RECOMENDAÇÃO DE REVISÕES ORIENTADA A
RANKING

LUCIANA BICALHO MAROUN
RECOMENDAÇÃO DE REVISÕES ORIENTADA A
RANKING
Dissertação apresentada ao Programa de
Pós-Graduação em Ciência da Computação
do Instituto de Ciências Exatas da Univer-
sidade Federal de Minas Gerais como req-
uisito parcial para a obtenção do grau de
Mestre em Ciência da Computação.
Orientadora: Mirella Moura Moro
Coorientadora: Jussara Marques de Almeida Gonçalves
Belo Horizonte
Março de 2016

LUCIANA BICALHO MAROUN
RANKING-ORIENTED REVIEW
RECOMMENDATION
Dissertation presented to the Graduate
Program in Computer Science of the Uni-
versidade Federal de Minas Gerais in par-
tial fulfillment of the requirements for the
degree of Master in Computer Science.
Advisor: Mirella Moura Moro
Co-Advisor: Jussara Marques de Almeida Gonçalves
Belo Horizonte
March 2016
c© 2016, Luciana Bicalho Maroun.
Todos os direitos reservados.
Maroun, Luciana Bicalho
M356r Ranking-oriented review recommendation / Luciana
Bicalho Maroun. — Belo Horizonte, 2016
xxvii, 124 f. : il. ; 29cm
Dissertação (mestrado) — Universidade Federal de
Minas Gerais — Departamento de Ciência da Computação
Orientadora: Mirella Moura Moro
Coorientadora: Jussara Marques de Almeida Gonçalves
1. Computação – Teses. 2. Recuperação da informação.
3. Sistemas de recomendação. I. Orientadora.
II. Coorientadora. II. Título.
CDU 519.6*73(043)


A Antônio Carlos, Valéria, Gabriel e Júlio, por tornarem tudo possível.
ix

Agradecimentos
Agradeço aos meu pais, Antônio Carlos e Valéria, pelo amor, dedicação e suporte
incondicionais, por sempre acreditarem em mim, e por serem os meus maiores exem-
plos. Nunca conseguirei retribuir tudo o que fizeram por mim, mas procurarei fazê-lo
eternamente. Ao meu irmão, Gabriel, pela autenticidade e leveza, que me servem de
inspiração, e por ser um presente em minha vida. A Júlio, pelo amor, companheirismo
e apoio imensuráveis; por ser a minha completude. As palavras falham em descrever o
quão abençoada sou por tê-lo ao meu lado. A Laís, pela compreensão, surpreendente
afinidade e presença constante; pela amizade pura e sincera. Aos meus avós, Dodora,
Rômulo, Regina e Michel, por serem os pilares essenciais. À minha família e à de
Júlio, por torcerem por mim. A Deus, pelas oportunidades e pelas pessoas incríveis
que colocou no meu caminho.
Não poderia deixar de agradecer à Mirella Moro e à Jussara Almeida, pela ori-
entação, dedicação e ensinamentos que vão além da academia. Aos professores Pedro
Olmo, Ana Paula Couto da Silva e Renato Assunção, pela colaboração e discussões. A
Thanis, Izabela, Elverton e Osvaldo, pela partilha das dificuldades durante o trajeto
que se estendeu da graduação até o mestrado. Aos colegas do LBD, pela solicitude em
contribuir e tirar dúvidas; em especial, à Michele Brandão, ao Sérgio Canuto, ao Daniel
Hasan e ao Daniel Xavier. À Marisa Vasconcelos, pela disponibilidade em ajudar. À
UFMG, ao DCC e ao CNPq, pelo suporte. Por fim, agradeço a todos que contribuíram
e acreditaram neste trabalho.
xi

“A vida é assim: esquenta e esfria,
aperta e daí afrouxa,
sossega e depois desinquieta.
O que ela quer da gente é coragem. ”
(Guimarães Rosa)
xiii

Resumo
A leitura de revisões online antes de efetuar uma compra se tornou habitual, uma vez
que os usuários as utilizam para realizar escolhas mais bem fundamentadas. Entre-
tanto, o crescente volume de revisões representa uma limitação para a sua eficácia de
tal forma que muitas técnicas tentam predizer a qualidade global das mesmas. Porém,
a utilidade pode não ser padronizada para todos os usuários devido a diferentes exper-
iências e preferências. Dessa forma, a recomendação de revisões de modo personalizado
é provavelmente mais precisa. Considerando essa hipótese, existem algumas estratégias
personalizadas para recomendação de revisões que, no entanto, não foram comparadas
entre si. Além disso, tais técnicas são tipicamente avaliadas usando uma métrica de
regressão, enquanto uma de ranking seria mais adequada por representar melhor o re-
sultado final apresentado aos usuários. A finalidade deste trabalho consiste em analisar
e comparar os métodos de recomendação de revisões especializados assim como outros
genéricos que não foram avaliados sob um ponto de vista personalizado e de ranking.
Além disso, objetiva-se identificar as propriedades que tornam um método eficaz para
o problema de recomendação de revisões. Usando um conjunto de dados do Ciao UK,
os experimentos realizados consideram ajuste de parâmetros, métricas de regressão e
de ranking, e comparação por meio de testes estatísticos. Os resultados experimentais
indicam que as abordagens mais simples superam as mais complexas, e atributos ob-
servados assim como a avaliação sob uma perspectiva de ranking são característisticas
importantes para esse problema. Todavia, as técnicas que realizam regressão de da-
dos observados obtêm, supreendentemente, o melhor desempenho para ranking: uma
Máquina de Vetor de Suporte para Regressão com kernel linear é a melhor abordagem,
seguida pela Regressão Linear. Ambas ultrapassam estatisticamente métodos de apren-
dizado de ranking, sistemas de recomendação tradicionais e abordagens especializadas.
Por fim, é proposta uma solução baseada no SVR que otimiza um erro de ranking em
conjunto e que explora a relação entre autor e leitor, provendo uma melhoria pequena,
porém significativa em relação à versão original.
xv
Palavras-chave: Revisões Online, Utilidade Personalizada, Sistemas de Recomenda-
ção, Recomendação Top-n, Ranking.
xvi
Abstract
Reading online reviews before a purchase has become a customary action, as users rely
on them to perform more grounded choices. Nevertheless, the increasing volume of
reviews works as a barrier to their effectiveness so that many approaches try to predict
reviews’ global quality. The perceived helpfulness, however, may not be standardized to
all users due to different backgrounds and preferences. Thus, recommending reviews
in a personalized fashion is likely to be more accurate. Regarding this hypothesis,
there are some personalized strategies for review recommendation that, however, are
still not compared against each other. Also, such techniques are typically evaluated
with a regression metric, whereas we argue that a ranking one is more adequate due
to representing better the final result presented to the users. The goal of this work is
to analyze and compare such specialized review recommenders as well as generic ones
that have not been previously evaluated under a personalized and ranking viewpoint.
We also aim at deriving insights about the properties of an effective predictor for the
review recommendation problem. Using a dataset from Ciao UK, the experiments
conducted here consider parameter tuning, both regression and ranking metrics, and
comparison through statistical tests. Our experimental results indicate that simpler
approaches overcome complex ones, and observed features as well as evaluation under
a ranking perspective are important traits for this problem. Nonetheless, techniques
performing regression on observed data obtain, surprisingly, the best performance for
ranking: Support Vector Regression (SVR) with linear kernel is the best, followed by
Linear Regression. Both statistically surpass learning to rank methods, traditional
recommender systems and specialized approaches. Finally, we propose an SVR-based
method that jointly optimizes a ranking loss and that exploits the relation between
author and reader, providing a slight but significant improvement over the original
version.
Palavras-chave: Online Reviews, Personalized Helpfulness, Recommender Systems,
Top-n Recommendation, Ranking.
xvii

List of Figures
2.1 Illustration of the elements involved in a recommender system. A user
demonstrates preferences to items through transactions, which may be per-
formed as different actions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Representation of a content-based recommender system, which relies on user
and/or item profiles and a similarity measure between them. . . . . . . . . 11
2.3 Representation of a recommender system applying a collaborative filtering
approach, which uses collaborative information from similar users or items. 12
2.4 Example of a user-item matrix with movies as items. Observed ratings are
numeric values, whereas unknown ones are represented by a question mark. 15
2.5 Result obtained from a rating prediction recommender system, whose goal
is to fill the missing values of the matrix. Predicted values, in red, are
estimated in a close range of observed ratings. . . . . . . . . . . . . . . . . 16
2.6 Result obtained from a top-n recommender system. Predicted values, in
red, are not necessarily in the same range of original ratings; instead, they
represent relative indices. Then, a ranking of is obtained for each user. . . 16
2.7 An overview of a review recommender system. Given a reader and a product
(e.g., a camera) with reviews written by authors, the system retrieves a
personalized ranking of reviews that matches the reader’s interests. . . . . 18
2.8 An example of a good regression, but poor ranking prediction. When re-
gression error is minimized, many inversions in user rankings may occur. . 19
3.1 Different ways of processing review data. . . . . . . . . . . . . . . . . . . . 22
4.1 Example of reviews in the list of latest ones in Ciao UK. Label 1 indicates
the rating; 2, the review title; 3, the product; 4, the author’s login; 5, the
review text; and 6, the overall helpfulness. . . . . . . . . . . . . . . . . . . 30
xix
4.2 Fragment of a review page in Ciao UK, in which not all users agree on the
review’s helpfulness. Label 1 shows the author login; 2, the rating; 3, basic
author reputation; 4, review text; 5, average helpfulness; and 6, distribution
of votes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Example of a review’s comments, which may be supplied with a vote. In
this example, users complain about missing information in the review, a
recurring reason for lower helpfulness votes. . . . . . . . . . . . . . . . . . 31
4.4 Distribution of ranking sizes in the original dataset. . . . . . . . . . . . . . 33
4.5 Distributions of activity and popularity, in log-log scale. Figure (a) con-
tains the distribution of the number of reviews written and number of votes
received per author. Figure (b) presents the number of votes given by users,
as voters. Figure (c) shows the number of votes received per review. . . . 34
4.6 Distributions of evaluation values: ratings as evaluations of products in
Figure (a) and votes as evaluations of reviews in Figure (b). . . . . . . . . 34
4.7 Scatter plots correlating user participation and mean helpfulness. . . . . . 35
4.8 Distribution of in-degree and out-degree in the network, in log-log scale. . 38
4.9 Subgraph of the trust network corresponding to the largest SCC. . . . . . 38
4.10 Investigation of author consistency and different entities conformity. . . . . 39
4.11 Scatter plot of CV versus number of votes for each grouping. . . . . . . . . 41
4.12 Distribution of Jaccard coefficient of trustees (Figure (a)), of authors’ num-
ber of trustors (Figure (b)), and authors’ number of trustees (Figure (c)). 44
4.13 Box plots for features in both InfoGain and ρs top 5 (except the number of
sentences), for each helpfulness level. . . . . . . . . . . . . . . . . . . . . . 45
5.1 Illustration of a linear SVR model. The solid line represents the predictive
model and dashed lines delimit tolerable deviations. Only instances whose
prediction is beyond the tolerance are considered as errors. . . . . . . . . . 52
5.2 Illustration of a RSVM model. There are two classes, a positive and a
negative, and each pair corresponds to a classified instance. Any pair beyond
the correct margin provides an error equal to the distance to such margin. 54
6.1 Preprocessing step for modeling the data to be used by the review recom-
mender techniques. Votes are divided into splits and, for each split, features
are extracted from reviews, users and interactions according to the training
set available. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
xx
6.2 Prediction step for any algorithm. The input is composed by a set of feature
vectors or user-item matrix. Then, the utility function is built and applied
to unseen instances, in validation or test sets. . . . . . . . . . . . . . . . . 76
6.3 Comparison of techniques within classes in Figures (a)-(e) and of best of
each class in Figure (f) using nDCG@p. The set of best algorithms includes
LR, as it stands between SVR and the other classes in performance. . . . . 83
6.4 Comparison of all techniques but LTR regarding RMSE. . . . . . . . . . . 83
6.5 Comparison of techniques with nDCG@p using random splits. The best of
each class is in Figure (a) and the best performer, RLFM, and collaborative
filtering (CF) methods are in Figure (b). RLFM is significantly the best. 84
6.6 Review excerpts with different helpfulness for the reader and product. . . . 90
6.7 Ranking obtained from a perfect SVR model with ε < 0.5. For integer
responses, absence of error in such model implies in absence of error in the
ranking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
C.1 PGM for BETF. Rectangles represent the cardinality of the given variable.
Observed variables are represented by dark circles and latent, by light ones.
We use hij as yhij . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
C.2 PGM for CAP. The notation is the similar to Figure C.1. . . . . . . . . . . 124
xxi

List of Tables
4.1 Statistics of the dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Statistics of the Trust Network. . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Features defined in the literature with a brief description. Each feature is
derived from an entity or relationship, represented in fragment headers. . . 42
4.4 Top 10 features with higher InfoGain. . . . . . . . . . . . . . . . . . . . . . 43
4.5 Top 10 features with higher absolute value of ρs. Features also in InfoGain
Top 10 are followed by a red circle. . . . . . . . . . . . . . . . . . . . . . . 43
5.1 Mapping of naming conventions from the original definition of LTR and
GRS classes (generic) to the review recommendation scenario (specific). . . 48
5.2 Symbols used for defining the methods. . . . . . . . . . . . . . . . . . . . . 49
5.3 Definition of variables used by RLFM. . . . . . . . . . . . . . . . . . . . . 57
5.4 Definition of variables used by BETF. . . . . . . . . . . . . . . . . . . . . 60
5.5 Formulas used by BETF algorithm. . . . . . . . . . . . . . . . . . . . . . . 61
5.6 Definition of variables used by CAP. . . . . . . . . . . . . . . . . . . . . . 63
5.7 Definition of sets used by CAP. . . . . . . . . . . . . . . . . . . . . . . . . 64
5.8 Formulas used by CAP algorithm. . . . . . . . . . . . . . . . . . . . . . . . 66
5.9 Summary of the techniques considered. . . . . . . . . . . . . . . . . . . . . 70
6.1 List of parameters with a description, evaluated range or set, and the best
value for ranking (nDCG@5*) and regression (RMSE*). . . . . . . . . . . . 78
6.2 Evaluation of techniques for the three metrics considered. Parameters used
for RMSE are different from those of ERR and nDCG@5. For ERR and
nDCG@5, the higher the better and for RMSE, the lower the better. . . . 80
6.3 Result of paired t-test represented by p-values for each class regarding all
metrics. The null hypothesis is represented by a pair of methods whose
mean is tested on equality. . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
xxiii
6.4 Obtained p-values from paired t-test of best of classes for nDCG@5 by
comparing a technique defined in the column with the one in the row. . . . 82
6.5 Obtained p-values from paired t-test of best of classes for ERR by comparing
a technique defined in the column with the one in the row. . . . . . . . . . 82
6.6 Obtained p-values from paired t-test of best of classes for RMSE by com-
paring a technique defined in the column with the one in the row. . . . . . 82
6.7 Ranking of best of classes according to each metric. . . . . . . . . . . . . . 84
6.8 Prediction of reviews helpfulness for the considered example by the three
best techniques, along with the ground-truth for helpfulness. . . . . . . . . 89
6.9 Prediction of rankings by the three best techniques for the top 5. The
ranking is obtained by sorting with predicted values and the corresponding
true value of each position is presented. . . . . . . . . . . . . . . . . . . . . 89
6.10 SVR performance for one set of features removed at a time. A paired t-test
is performed between a new version and the original, with all features, and
p-values are shown. A red down arrow represents significant worsening, and
a green equality sign, a statistical tie. . . . . . . . . . . . . . . . . . . . . . 93
6.11 SVR performance for a set of features defined by a feature selection criterion,
InfoGain and ρs. A paired t-test is performed between SVR with all features
and with the corresponding subset, then p-values are shown. A blue up
arrow is presented for significant improvement, and a green equal sign for a
statistical tie. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.12 New proposed features derived from voter-author interaction. . . . . . . . . 95
6.13 Variations upon SVR. A blue up arrow shows significant improvement. . . 95
6.14 List of parameters with a description, evaluated range or set, and the best
value for ranking (nDCG@5*) for CoRaSVR. . . . . . . . . . . . . . . . . . 98
xxiv
Contents
Agradecimentos xi
Resumo xv
Abstract xvii
List of Figures xix
List of Tables xxiii
1 Introduction 1
1.1 Review Platform Scenario . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Open Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Text Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Concepts and Problem Definition 9
2.1 Recommender Systems Basics . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Types of Recommender Systems . . . . . . . . . . . . . . . . . . 11
2.1.2 Types of Entities . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.3 Types of Feedback . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.4 Incorporating Biases . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.5 Recommendation Tasks . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Review Recommendation Scenario . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Related Work 21
3.1 Mechanisms for Processing Review Data . . . . . . . . . . . . . . . . . 21
3.1.1 Global Helpfulness Prediction . . . . . . . . . . . . . . . . . . . 22
xxv
3.1.2 Personalized Review Recommendation . . . . . . . . . . . . . . 23
3.1.3 Review Summarization . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.4 Keyword Search over Reviews . . . . . . . . . . . . . . . . . . . 24
3.1.5 Product Recommendation . . . . . . . . . . . . . . . . . . . . . 25
3.2 Recommendation as a Ranking Task . . . . . . . . . . . . . . . . . . . 25
3.3 Analyses of Reviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 Helpfulness Voting Dynamics: A Case Study in Ciao UK 29
4.1 A Closer Look at Ciao UK . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Dataset Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Participatory Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 User Connections over a Trust Network . . . . . . . . . . . . . . . . . . 36
4.5 Review Helpfulness: Universal or Individual? . . . . . . . . . . . . . . . 39
4.6 Characterization through Features . . . . . . . . . . . . . . . . . . . . . 41
4.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5 Multiple-Paradigm Techniques 47
5.1 Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Mean-based Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3 Regressors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.4 Learning to Rank Predictors . . . . . . . . . . . . . . . . . . . . . . . . 53
5.5 General-Purpose Recommender Systems . . . . . . . . . . . . . . . . . 55
5.6 Review Recommender Systems . . . . . . . . . . . . . . . . . . . . . . . 58
5.7 Implementation Specifics . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.8 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6 Experimental Analysis 73
6.1 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2 Parameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.3 Comparison of Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.4 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.4.1 Ranking versus Regression Goal . . . . . . . . . . . . . . . . . . 85
6.4.2 Observed versus Latent Features . . . . . . . . . . . . . . . . . 86
6.4.3 Simple versus Sophisticated Predictive Model . . . . . . . . . . 86
6.4.4 Absent versus Present Bias . . . . . . . . . . . . . . . . . . . . . 87
6.4.5 Contrasting with Previous Experiments . . . . . . . . . . . . . . 87
6.4.6 Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . 89
xxvi
6.5 Inspecting the SVR Model . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.5.1 Investigating Features Impact . . . . . . . . . . . . . . . . . . . 92
6.5.2 Contrasting SVR Versions . . . . . . . . . . . . . . . . . . . . . 94
6.5.3 Detailing CoRaSVR . . . . . . . . . . . . . . . . . . . . . . . . 95
6.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7 Conclusion 99
7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Bibliography 103
A Assumptions about Features 111
B Mathematical Proofs 113
B.1 Complete Log-Likelihood for CAP . . . . . . . . . . . . . . . . . . . . . 113
B.2 Derivatives of Expectation of Log-Likelihood for CAP . . . . . . . . . . 116
B.3 Derivatives of Loss Function for BETF . . . . . . . . . . . . . . . . . . 120
B.4 Derivatives of Loss Function for CoRaSVR . . . . . . . . . . . . . . . . 122
C Probabilistic Graphical Models of Review Recommender Systems 123
xxvii

Chapter 1
Introduction
Online reviews platforms, such as Yelp1, Amazon2, TripAdvisor3 and Ciao UK4, play an
important role on users’ search for goods and services. To illustrate, a study conducted
in the United States in 2014 points out that 88% of consumers have read online reviews
before a purchase (vs. 85% in 2013) and 39% read them regularly (vs. 32% in 2013)
[Anderson, 2014]. Besides, 72% of interviewees said that positive reviews increase their
trust in a local business, and 88% trust online reviews as much as personal recommen-
dations. The impact on sales was also measured quantitatively: increasing half star on
Yelp raises businesses’ sells by 19 percentage points [Anderson and Magruder, 2012].
Thus, online reviews also impact business owners.
For consumers, finding good reviews has become very difficult as too many are
being produced. In such scenario, it is impossible to read every review whereas not all
of them have the information readers are seeking. Thus, the great volume of reviews
limits their effectiveness as a source of valuable information.
Reviews differ greatly in style and quality: some are detailed descriptions of
positive and negative aspects, others superficial and subjective. Besides, the same
review may be considered helpful by some and not helpful by others. In fact, the
concept of helpfulness is not necessarily homogeneous for different users, as they have
different backgrounds and preferences. In general terms, customers demand a technique
for selecting useful reviews considering personal needs and preferences.
One way to meet such demand consists in defining a helpfulness index for each re-
view and filtering the most suitable ones. Indeed, there are plenty of non-personalized
solutions predicting a global helpfulness for each review, which is common to all read-
1Yelp. http://www.yelp.com
2Amazon. http://www.amazon.com
3TripAdvisor. http://www.tripadvisor.com
4Ciao UK. http://www.ciao.co.uk
1
2 Chapter 1. Introduction
ers [Lee and Choeh, 2014; Lu et al, 2010; O’Mahony and Smyth, 2009; Zhang and
Tran, 2011]. Nonetheless, recent solutions demonstrate improvement by dealing with
this problem in a personalized way, i.e., considering users’ idiosyncrasy for helpfulness
prediction [Moghaddam et al., 2012; Tang et al, 2013].
Other approaches may be used for processing the great number of online reviews:
summarizing reviews’ content [Hu and Liu, 2004; Zhan et al., 2009], allowing search of
features inside reviews [Duan et al., 2013; Ganesan and Zhai, 2012] and recommending
products based on such features [McAuley and Leskovec, 2013; Xu et al., 2012]. These
options spare the need for reading reviews in their original format and may not be the
best solution for a user demanding to deeply understand a product or service. Certain
users might want to know exactly what others have perceived about a product and
decide on their own if it suits their needs. Then, filtering original reviews is the desired
solution in this case and is the focus of our work.
Next, we describe the context of a review platform in Section 1.1, discuss chal-
lenges in review exploitation by readers in Section 1.2, state the main contributions of
our work in Section 1.3 and, finally, present text organization in Section 1.4.
1.1 Review Platform Scenario
A review-enabled application allows users to express their opinions through reviews
about products, businesses, locations, news, movies and many others, generically re-
ferred to as products. An online review consists of a report available online created by
a certain individual about his or her experience involving a product. Such platforms
are impacting the behavior of consumers, who more often consult online reviews before
a buying decision. As a consequence, users can make more informed and grounded
choices and avoid frustrations after a purchase.
Such systems also provide a feedback for businesses’ owners, allowing them to
plan improvements upon that. Additionally, as Simonson and Rosen [2014] emphasize,
products were previously sold mainly due to brand knowledge. Nowadays, a few con-
sumers writing online reviews is enough for many people to know whether a product
meet their needs, thus enabling the rise of unknown brands. This leads to a fairer
market where quality and not only renown are considered.
There are several studies attempting to identify the motivations for a user to
write a review [Christodoulides and Jevons, 2011; Dahlgren et al., 2015; Dellarocas
and Narayan, 2006; Hennig-Thurau et al., 2004; Rensink, 2013]. They may want to
add value to an online community, then they write to be a part of such community, to
1.1. Review Platform Scenario 3
exert power and to benefit others. They are also inspired to contribute due to their own
need for consuming reviews; therefore, they want to do their part. Another reason lies
in self-enhancement as users want to have good thoughts about themselves and their
statuses [Dahlgren et al., 2015]. They may also feel frustrated with a purchase and
write a review to release anxiety and even penalize the provider. On the other hand,
a positive experience may create a gratitude desire for helping the company [Rensink,
2013]. As a drawback, there may be malicious users aiming at arbitrarily promoting
or penalizing products. In short, different reasons lead users to generate their own
experience report through reviews.
The elements of a review vary a little among different applications. In general,
a user inserts a rating for a product under evaluation, between zero and five stars for
instance, and composes a text justifying the rating and including pros and cons. In
some cases, reviews may be enriched with media, such as images and videos. Moreover,
a user is able to interact with others by evaluating reviews through a helpfulness vote.
Such vote may be either a grade in a range, between zero and five for example, or a
binary evaluation, i.e., positive or negative. The combination of all votes composes the
overall helpfulness of a review, computed using some aggregating function. In some
scenarios, inserting a comment about a review is also possible.
Thereafter, items have an overall rating in the platform (usually the calculated
mean of ratings) and a popularity degree given by the number of reviews ; while reviews
have equivalently an overall helpfulness and popularity as the number of votes. For
example, in Yelp and TripAdvisor, the overall helpfulness is the total number of positive
votes; in Amazon, it is the ratio of positive by the total number of votes; and in Ciao
UK, it is the average of vote values, which are in a range5. Users also have a profile,
which may include the number of written reviews, the distribution of given ratings, the
average helpfulness of written reviews, date of registration, among several others. In
fact, many features may be derived from related reviews and votes. Some applications
even allow the interaction among users through a social network whose links may either
represent friendship, as in Yelp, or trust, as in Ciao UK.
An application allowing evaluation of users through reviews is, therefore, an en-
vironment full of useful information that may be obtained direct or indirectly. For that
reason, reviews have a potential of developing data-driven products, applications that
provide relevant information from complex automated analyses [Fu and Asorey, 2015].
5In Ciao UK, the arithmetic mean is not used, but a weighted mean considering users reputation (which
are not publicly disclosed). Source: http://www.ciao.co.uk/faq/rating-product-reviews,48#
answer_5
4 Chapter 1. Introduction
1.2 Open Challenges
Although consumers depend on reviews to make decisions, searching for suitable ones
is not easy. The great adherence of users submitting reviews derives an intractable
amount of information with high variability in style and quality. Indeed, Yelp has more
than 90 million reviews6; and TripAdvisor, more than 290 million with an average of
55 reviews per business7, both by the end of the third quarter of 2015. On the other
hand, a survey indicates that 67% of users read up to 6 reviews; 85%, up to 10; 93%,
up to 20; remaining only 7% who read more than 20 [Anderson, 2014]. Thus, whereas
users are not willing to read many reviews, a lot of them are being produced such that
extracting relevant information from these extensive data becomes imperative.
As a real example, Meyers8 explored the difficulty in investigating the quality of
popular products using Kindle Fire as example, which had 10,859 reviews on Amazon
at the time of writing the article (on April, 2013)9. Considering an optimistic scenario,
where reading a review demands approximately five seconds and only reviews of one
and five starts are investigated, 7,208 in total: reading all would require 10 hours.
This estimation motivates the current impossibility of scanning all reviews and the
importance of filtering those more helpful to the end user.
To solve this problem, many review-enabled applications allow users to assess
reviews through a helpfulness evaluation (e.g., from zero to five). Still, there are
many reviews without evaluation and more being produced every day. In Amazon, for
instance, only 10% of reviews have at least 10 helpfulness evaluations10. Hence, having
all of them evaluated by a representative amount of users is practically impossible.
Also, reviews with many positive votes are often ranked on top in a rich-get-richer
effect, so that new ones are hardly considered useful [Moghaddam et al., 2012].
Several approaches [Liu et al., 2008; Lee and Choeh, 2014; Lu et al, 2010; Martin
and Pu, 2014; O’Mahony and Smyth, 2009; Zhang and Tran, 2011] try to overcome the
sparseness and the rich-get-richer effect in a non-personalized fashion. They predict a
global helpfulness score by considering aspects such as author’s reputation and textual
statistics, for example, from part-of-speech tagging and sentiment analysis. Nonethe-
less, users have different backgrounds and preferences. For instance, a review about a
photographic camera written by a professional photographer may be useful for another
6Yelp – About Us. http://www.yelp.com/about
7TripAdvisor – Fact Sheet. http://www.tripadvisor.com/PressCenter-c4-Fact_Sheet.html
8How I Wish Amazon Reviews Worked, by Dr. Peter J. Meyers. April 2013. http:
//moz.com/blog/how-i-wish-amazon-reviews-worked
9The same version of Kindle Fire contained 25,501 reviews in February, 2016.
10A Statistical Analysis of 1.2 Million Amazon Reviews, by Max Woolf. June 2014. http:
//minimaxir.com/2014/06/reviewing-reviews
1.2. Open Challenges 5
expert, but not equally for a common user, who is more interested in basic resources
and ease of use. Consequently, the helpfulness of a review is unlikely to be equally
perceived by all readers.
Despite the efforts to design personalized recommendations in other domains (e.g.,
product recommendation [Balakrishnan and Chopra, 2012; Koren, 2008; Machanava-
jjhala et al., 2011]), only recent works consider personalized solutions for review recom-
mendation [Moghaddam et al., 2012; Tang et al, 2013]. A personalized recommender
system suggests new items (reviews, products, services, locations, etc.) for a certain
user by predicting individual interest, on contrary of non-personalized recommendation
based solely on general aspects of items, such as popularity.
Furthermore, although some of techniques for review recommendation exist, their
experimental evaluations have not considered yet many existing techniques nor the best
configuration of each under a unified statistical experimental project. Therefore, our
goal is to compare existing approaches for review recommendation, both generic and
specialized ones, divided in five paradigms: (i) mean-based predictors; (ii) general
purpose regressors; (iii) learning to rank methods; (iv) general-purpose recommender
systems; and (v) review recommender systems, the specialized approaches.
Our experiments differ from others in three ways. First, to the best of our knowl-
edge, we are the first to compare a rich set of techniques for review recommendation,
comprising of specialized solutions and other personalized competitive approaches.
BETF [Moghaddam et al., 2012] and CAP [Tang et al, 2013], the specialized ap-
proaches, are not compared against each other in their original publications. Also,
Linear Regression [Murphy, 2012] and Support Vector Regression (SVR) [Smola and
Schölkopf, 2004] were considered for predicting review helpfulness [Moghaddam et al.,
2012; Tang et al, 2013; Vasconcelos et al., 2015], but using only non-personalized fea-
tures, whereas we apply personalized ones also. Moreover, the parameters and the
versions of the baseline algorithms are not clear in existing evaluations. Overall, com-
petitive approaches were not yet considered in their full potential, leaving unknown
which the best solution is. We are also the first to apply learning to rank predictors
and Regression-based Latent Factor Model (RLFM), a sophisticated recommender sys-
tem, for review recommendation task.
Second, we compare the methods with statistical significance. This is accom-
plished by executing experiments in multiple runs: five different training and test
sets, and three repetitions for non-deterministic methods due to stochastic variability.
Then, our comparison provides a confidence of 95%. Specialized approaches are non-
deterministic and, yet, variability and significance are not reported on the experiments
previously conducted with them [Moghaddam et al., 2012; Tang et al, 2013].
6 Chapter 1. Introduction
Finally, we consider a ranking metric for evaluation, since the ultimate goal is
to display reviews sorted decreasingly by helpfulness. Recommendation of reviews is
intrinsically a top-n recommendation task [Cremonesi et al., 2010], whose purpose is to
compose a selection of items, preferentially ranked, and should be evaluated as such.
Evaluating recommender systems through ranking metrics is not a new proposal [Bal-
akrishnan and Chopra, 2012; Christakopoulou and Banerjee, 2015; Cremonesi et al.,
2010; Koren, 2008], whereas regression metrics are still common [Koren et al., 2009;
Moghaddam et al., 2012; Tang et al, 2013]. However, such metrics suffer from the
following drawbacks: they aim at exact prediction of evaluations, then disagreeing
with the purpose of top-n recommender systems, for which a good ordering of items
is preferred; and they consider equally important the prediction of all relevance levels,
whereas higher ones should have priority as the end user is mainly interested on them
[Christakopoulou and Banerjee, 2015]. Exact prediction is too restrictive for this prob-
lem and, when aiming at such goal, hardly a perfect model is built. The existing errors
may cause an ordering that is not effective since this is not the objective. Indeed, abso-
lute predictions are not even displayed to the end user, they are only used for ranking.
By using a ranking-oriented definition and metric, the focus is only in what matters:
a good sorted selection of items.
1.3 Main Contributions
Given the great volume of reviews and the diverse perception of helpfulness by users,
the purpose of this work is to evaluate existing approaches and investigate key aspects
for designing a review recommender system with a ranking goal. Specifically, we
evaluate current solutions for review recommendation with a ranking perspective and
statistical tests, highlight the properties of effective solutions, pursue extensions and
define promising future work. Overall, our contributions are summarized as follows.
(i) Evaluating techniques from multiple paradigms in a real dataset. We
compare a rich set of the techniques for review recommendation from five different
paradigms. We describe such techniques in deep, allowing the detection of causes
for experimental results. We consider a real dataset and perform investigations
of helpfulness voting patterns to substantiate experiments and solutions. Our
experimental design is unified and fills a gap from previous experiments regarding
this problem. Some of the techniques whose source code were not found are
implemented from scratch and are made publicly available.
1.4. Text Organization 7
(ii) Modeling and evaluating review recommendation from a ranking per-
spective. Previous techniques defined the problem as a rating prediction, whose
goal was to predict absolute values of helpfulness. However, the final result pre-
sented to users is a ranking of reviews, then absolute predictions are unnecessary.
Therefore, we focus on a ranking-oriented perspective of the problem and also
contrast with commonly accepted regression goal.
(iii) Identifying key design aspects and improving upon the state-of-the-
art. By comparing techniques from multiple paradigms and with very different
designs, we are able to identify the properties that perform well for review recom-
mendation problem. Additionally, we define new features and a new technique,
Combined Ranking and SVR (CoRaSVR), and both slightly but significantly
outperform the state-of-the-art of review recommendation.
1.4 Text Organization
The remainder of the work is organized as follows. Chapter 2 defines basic concepts
and states the problem. Chapter 3 addresses related work. Chapter 4 presents the
dataset used and investigates key properties for review recommendation task. Chapter
6.1 describes the techniques considered. Chapter 6 presents the statistical experimen-
tal design and the evaluation of techniques regarding parameter tuning, performance
comparison, and extensions. Lastly, Chapter 7 addresses conclusions and future work.

Chapter 2
Concepts and Problem Definition
Research on recommender systems is relatively recent, comprising an individual area
of study in the mid-1990’s [Ricci et al, 2010]. Great attention was drawn to such
research topic due to a Netflix competition, which awarded a million dollar for those
who improved the performance of Netflix’s movie recommendation in 10% [Koren et al.,
2009].
The motivations behind a recommender system service provider lie in several as-
pects. Selling more, gaining user loyalty, improving user satisfaction and understanding
user’s needs are examples of providers’ interests. Consumers also are driven to par-
ticipate in order to find good items that otherwise would be difficult to encounter, to
improve their profile and have more accurate suggestions, and to express ideas, help
and influence others [Ricci et al, 2010]. In short, recommender systems may benefit
both service providers and consumers.
In this chapter, we first introduce basic concepts related to recommender systems,
their dynamics and types in Section 2.1. Then, we present the specific case of review
recommendation and state the problem in Section 2.2. Finally, we address concluding
remarks in Section 2.3.
2.1 Recommender Systems Basics
The initial development of recommender systems was built upon the observation that
the majority of people rely on recommendations from peers and experts in daily
decision-making [Ricci et al, 2010]. The simplest recommender technique suggests the
most popular items to a user, in a non-personalized fashion. This practice is reasonable
since, for an unknown user, the best solution lies in recommending for an average per-
son. Indeed, the most popular items are more likely to be enjoyed than a random one.
9
10 Chapter 2. Concepts and Problem Definition
item
user
buy
click
like
share
review
transactions
implicit feedback
explicit feedback
Figure 2.1: Illustration of the elements involved in a recommender system. A user
demonstrates preferences to items through transactions, which may be performed as
different actions.
Examples of recommendations based on popularity are the top 10 best-seller books,
the most listened videos, and so on. In spite of that, the majority of recommendation
approaches provides personalized suggestions, trying to surpass prediction based on
popularity [Ricci et al, 2010]. In fact, the path for improving recommender systems is
exactly towards overcoming the baseline popularity prediction through personalization,
i.e., suggesting items, not necessarily popular, that the user like.
Figure 2.1 presents the three types of elements related to a recommendation
problem: users, items and transactions [Ricci et al, 2010]. A user is an entity that
recommendation results are intended for. His or her preferences regarding like and
dislike of items are either individually entered or collected within the navigation.
An item is a general term indicating the object that the system suggests to the
users. It may correspond to products to buy, places to visit and books to read, to
name a few. Although in a review platform products and services might be items for
recommendation, we instead focus on the recommendation of reviews as items.
At last, a transaction is an interaction performed under a certain context between
a user and an item. It can be acquired explicitly, when users manually assert evaluations
to items, or implicitly, derived from users’ actions accomplished with other purposes.
The applicability of many recommender approaches is broad thanks to an abstrac-
tion of user, item and transaction that spans many scenarios. However, information
regarding a particular environment, with specific entity types, might considerably im-
prove performance, since such scenario may present special patterns and dynamics.
2.1. Recommender Systems Basics 11
Figure 2.2: Representation of a content-based recommender system, which relies on
user and/or item profiles and a similarity measure between them.
2.1.1 Types of Recommender Systems
Recommender systems are traditionally divided in two types: content-based and col-
laborative filtering [Lu et al., 2014]. A content-based recommender suggests items to
a user based on the kind of items the user liked in the past, represented by observed
features. In turn, a collaborative filtering method suggests items liked by similar users
or similar items of the ones previously liked, such that the similarity is based on rat-
ings. When both item profiles and collaborative information are taken into account,
then a hybrid recommender system is composed which tends to improve performance
by leveraging the drawbacks of each type [Ricci et al, 2010].
Specifically, a content-based recommender system suggests new items to a user
based on the properties of the ones previously liked. The recommendation process is
represented in Figure 2.2. Item profiles are composed by a structured set of observed
features, for example, category, brand, material, and so on. Users are individually
modeled through profiles by aggregating the preferences on previously evaluated items.
Such profiles commonly model how much a user values each item feature and may be
obtained using a heuristic or a learning method [Ricci et al, 2010]. A content-based
system assumes that a user who liked certain items in the past will like similar items
in the future. However, they lack diversity when users do not evaluate many different
items and are not able to recommend for a person who has not evaluated anything.
12 Chapter 2. Concepts and Problem Definition
Figure 2.3: Representation of a recommender system applying a collaborative filtering
approach, which uses collaborative information from similar users or items.
Building item profiles requires domain-expertise, although there are well-known
guidelines, specially borrowed from other traditional research spheres [Ricci et al, 2010].
For example, derived from Information Retrieval, items composed by text, such as web
pages, news and even reviews, may be represented using a vector space model. Cosine is
the typical similarity measure in this scenario [Ricci et al, 2010]. Inspired by Artificial
Intelligence, content-based systems may use past knowledge from user’s evaluations to
derive prediction rules for new items. For example, a classifier may be built for each
user in order to predict if he will like or not an item [Ricci et al, 2010].
On an opposite track, collaborative filtering techniques rely on user-user or item-
item similarity for recommendation, as Figure 2.3 outlines. A vector of given or re-
ceived ratings represents users and/or items. Such abstraction is domain-independent
and may be applied indistinctly to all scenarios, with any type of users, items and
transactions. The basic assumption of these systems is that two users who had similar
preferences in the past will share the same taste in the future; the same holds for two
items that were liked by a representative share of common users.
Collaborative filtering is further specialized into memory-based or model-based
solutions. A memory-based approach finds similar users or items relative to the user-
item pair whose preference is to be predicted. The weighted average of similar users who
rated the item (user-based) or the weighted average of similar items rated by the user
(item-based) is used for prediction. They are also called neighborhood-based methods
as they predict based on close entities [Ricci et al, 2010]. A model-based approach, on
the other hand, transforms the matrix of ratings into a model. Past ratings are not
used directly, but for building representations for users and items into the model.
2.1. Recommender Systems Basics 13
There are other more fine-grained types of recommender system not described
here. They are: demographic systems, relying on demographic profiles of users; knowl-
edge based systems, which use specific domain knowledge of how items features meet
user needs; and social recommender systems, based on information from friendships
and other social networks [Ricci et al, 2010].
2.1.2 Types of Entities
Cold-start entities are a habitual challenge mainly in collaborative filtering solutions
[Agarwal and Chen, 2009; Koren et al., 2009]. A cold-start refers to a user or an item
whose past transactions are scarce or not available, i.e., they are new or unknown by
the system, whereas the opposite is called warm-start. Under this circumstance, finding
similar entities in collaborative filtering systems is difficult and cold-start users do not
have a representative profile in content-based solutions.
The system may completely fail to provide recommendations for a cold-start. A
hybrid technique rises as a great option for overcoming cold-starts by using the best
solution under each scenario. For example, a content-based technique may be used for
prediction of cold-start items and a collaborative filtering for warm-start.
2.1.3 Types of Feedback
User preferences may be obtained through explicit or implicit feedback expressed by
users [Amatriain et al., 2009]. Explicit taste is inserted manually and restricted to
cases when users feel motivated to contribute. Implicit ratings are derived from users
behavior carried out with other intention, e.g., from a click, the duration of visualization
of a web page, a purchase, a share with friends, and so on, as illustrated in Figure 2.1.
However, explicit ratings suffer from user inconsistencies named natural noise
[Amatriain et al., 2009; Parra et al., 2011]. Distinct motivations may lead to different
evaluation criteria, as well as insertion under distinct contexts, e.g., represented by time
and location. For example, a user on vacation may tend to give higher ratings than
after a stressful day of work. Thus, an effective method should be able to overcome
noise.
On the other hand, implicit feedback is more difficult to relate with user pref-
erences since it is derived indirectly, working as a heuristic. Nonetheless, in certain
scenarios, it is the only source available due to user overhead in providing explicit
opinion [Parra et al., 2011]. Implicit feedback is typically related to a positive interest,
such as clicking into a web page or viewing it for long, whereas negative opinion is
14 Chapter 2. Concepts and Problem Definition
hard to be derived [Hu et al., 2008]. Thus, combining both types of feedback is usually
better for leveraging the drawbacks of each one.
2.1.4 Incorporating Biases
Typically, rating values are partially explained by users’ and items’ tendency of giving
or receiving higher or lower evaluations, which is called bias [Ricci et al, 2010]. An
entity bias is the average deviation of its ratings from the overall mean. A baseline
predictor may be composed by only using bias variables, as follows:
τ(u, i) = µ+ bu + bi, (2.1)
where τ is the predictive function, u is a user, i is an item, µ is the mean rating, bu is
user bias, and bi is item bias. The last two variables may be learned by an optimization
method, such as stochastic gradient descent [Koren et al., 2009]. Alternatively, such
variables may be computed statically as follows:
bu =
∑
i∈Iu(Rui − µ)
|Iu|+ λ1 (2.2)
bi =
∑
u∈Si(Rui − µ− bu)
|Si|+ λ2 , (2.3)
where Iu is the set of items evaluated by user u, Si is the set of users who evaluated
item i, Rui is the rating given by user u to item i, and λ1 and λ2 are regularization
parameters [Ricci et al, 2010].
Using such baseline, the remaining ratings unexplained by bias, obtained by sub-
tracting the bias baseline prediction, may be fitted using a second method. Thus, only
the rating portion related to user-item interaction is adjusted in the second.
2.1.5 Recommendation Tasks
Recommender systems techniques typically rely on existing labeled data, the training
set, which contains historical examples of truth labels for the level of interest of users
to items. Using this set and assuming that past patterns are repeated in the future,
a predictor is built as a function that, given an input user, returns a set of suggested
items. Parameters related to predictors (e.g., coefficients, latent factors, weights) are
typically adjusted to optimize prediction in the training set. Then, unseen data is
used for evaluation of the method, designated the test set. Sometimes, a portion of
2.1. Recommender Systems Basics 15
4 5 3 ?
? 5 ? 2
? 1 5 5
? 4 ? 4
I
n
c
e
p
t
io
n
T
h
e
 M
a
t
r
ix
C
a
s
a
b
la
n
c
a
B
e
f
o
r
e
 S
u
n
s
e
t
John
Alice
Bob
Mary
Items
U
s
e
r
s
Figure 2.4: Example of a user-item matrix with movies as items. Observed ratings
are numeric values, whereas unknown ones are represented by a question mark.
the training set is removed from the learning process and used for parameter tuning,
named the validation set,.
A user-item matrix usually represents past ratings, which are in the training set.
Figure 2.4 shows an example, with users represented in lines and items (movies in the
example) in columns. To exemplify, user Bob gave a rating of 3 to item Inception. In
general, if user i gives rating y to item j, then Rij = y; if user i does not rate item j,
then we represent Rij = ?.
A recommendation problem may be defined according to two perspectives: (i)
rating prediction task, aiming to predict the exact level of interest of a user to an item;
or (ii) top-n recommendation task, aiming to select n items most preferred to a user
[Deshpande and Karypis, 2004]. Although the majority of traditional strategies focuses
on the former definition [Koren et al., 2009; McAuley and Leskovec, 2013; Moghaddam
et al., 2012; Tang et al, 2013; Xu et al., 2012], the former has recently gained attention
[Balakrishnan and Chopra, 2012; Bidart et al., 2014; Christakopoulou and Banerjee,
2015; Cremonesi et al., 2010; Koren, 2008], since many applications ultimately rank
items using predicted ratings.
Specifically, the purpose of a rating prediction task is to fill missing values in the
user-item matrix, that is, the cells in which Rij = ?. Figure 2.5 shows an example of
such predictions. Consequently, the values have to be in a close scale to true ratings,
for example, from zero to five. In this task, the closer the predicted value to the true
16 Chapter 2. Concepts and Problem Definition
 4 5 3        3.5
 4 5        3.7 2
 4 1 5 5
 4 4        3.9 4
I
n
c
e
p
t
io
n
T
h
e
 M
a
t
r
ix
C
a
s
a
b
la
n
c
a
B
e
f
o
r
e
 S
u
n
s
e
t
John
Alice
Bob
Mary
Figure 2.5: Result obtained from a rating prediction recommender system, whose
goal is to fill the missing values of the matrix. Predicted values, in red, are estimated
in a close range of observed ratings.
  4 5 3         8
10 5         3 2
 6 1 5 5
 7 4         11 4
I
n
c
e
p
t
io
n
T
h
e
 M
a
t
r
ix
C
a
s
a
b
la
n
c
a
B
e
f
o
r
e
 S
u
n
s
e
t
John
Alice
Bob
Mary
John: Before Sunset
User Rankings
Alice: Inception
Casablanca
Bob: Inception
Mary: 1. 
2.
Casablanca
Inception
1. 
1. 
2.
1. 
Figure 2.6: Result obtained from a top-n recommender system. Predicted values, in
red, are not necessarily in the same range of original ratings; instead, they represent
relative indices. Then, a ranking of is obtained for each user.
one, the better. Then, evaluating the proximity of both values is necessary. This is
accomplished by regression error metrics such as Root Mean Squared Error (RMSE)
or Mean Absolute Error (MAE) [Hyndman and Koehler, 2006].
In turn, a top-n recommendation task does not focus on filling the matrix with
exact values; instead, relevance values are used in any range just to allow the selection
and ranking of the top-n items. Figure 2.6 demonstrates relevance estimation with a
2.2. Review Recommendation Scenario 17
simple example. Regression metrics do not make sense in this scenario, only classifica-
tion metrics to evaluate the selection, such as precision and recall [Powers, 2007], and
ranking ones to evaluate the selection and ranking simultaneously, such as Normalized
Discounted Cumulative Gain at position p (NDCG@p) [Burges et al., 2005] and Mean
Average Precision at p (MAP@p) [Yue et al., 2007].
The purpose of a recommender system is often to compose a ranked selection of
the best items to a user. Consequently, optimality in a top-n task is more relevant to
this end and, thus, we model review recommendation as such.
2.2 Review Recommendation Scenario
In a website of online reviews, a user may perform different roles. As an author, the
user writes a review about a product (we use product as naming convention, but it
may be any object in a review platform). As a reader, the user simply reads through
existing reviews of a product. Then, as a voter, the user evaluates the helpfulness
of existing reviews. Note that an individual may act as author and as voter. Both
profiles may be used to gather more information regarding a user for recommendation:
the current author may have a profile for previously acting as a voter, and the current
voter as well for performing an author role before.
Specifically, a review is composed by a text (describing features about the product,
e.g., experience of usage, pros and cons) and a rating (giving a grade to how much the
author liked the product, e.g., 0 to 5 stars). Rating differs from vote, the grade given
by a voter to a review.
Formally, this scenario considers a set of products P = {p1, p2, ..., pk}, a set of
users U = {u1, u2, ..., ul}, a set of reviews R = {r1, r2, ..., rm} and a set of helpfulness
votes H = {h1, h2, ..., hn}. We define two functions to state relationships between
elements in those sets: pir : R 7→ P × U such that pir(r) = (p, a) states that review
r was written by author a about product p; and pih : h ∈ H 7→ R × U such that
pih(h) = (r, v) defines that helpfulness vote h was given by voter v to review r. We
emphasize that both author a and voter v belong to set U .
2.2.1 Problem Statement
Optimally, a review recommender system provides a ranking of reviews in descending
order of helpfulness for a given user and product, as shown in Figure 2.7. Thereafter,
whenever a reader accesses a product, the most helpful reviews are on top, eliminating
the burden to manually look for suitable ones.
18 Chapter 2. Concepts and Problem Definition
product
author
review
reader
1
2
3
relative vote
Figure 2.7: An overview of a review recommender system. Given a reader and
a product (e.g., a camera) with reviews written by authors, the system retrieves a
personalized ranking of reviews that matches the reader’s interests.
The goal of a recommender system is to compose a utility function1 [Adomavicius
and Tuzhilin, 2005], restricted here for the specific case of review recommendation.
Accordingly, the purpose of a review recommender system is to derive a utility function
τ : U ×R → R such that, given an input (u, r), it measures the relative helpfulness of
review r for a user u. Fitting τ requires a training set {P0,U0,R0,H0} : P0 ∈ P , U0 ∈
U , R0 ∈ R, H0 ∈ H, unless a heuristic without a learning step is used. Users in the
training set are further divided into those who acted as voters V0 and those who acted
as authors A0, such that V0 ∪ A0 = U0.
After the utility function is built, given a product p and a reader u, we may
obtain a set of reviews associated to product p, defined as Rp = {r ∈ R | ∃a ∈ U :
pir(r) = (p, a)}, and derive a ranking of reviews r ∈ Rp in decreasing order of τ(u, r).
The utility function is optimal when rankings are generated in decreasing order of the
ground-truth of helpfulness given by the users.
Stating review recommendation as a top-n task differs from most prior analyses of
this problem. Past works focused on predicting absolute values for review helpfulness,
i.e., they considered review recommendation as a rating prediction task [Moghaddam
et al., 2012; Tang et al, 2013]. Nonetheless, helpfulness predictions are usually not
1We use the term utility to indicate the user interest in a generic recommendation task, and helpfulness
to refer to the specific opinion of users about reviews.
2.2. Review Recommendation Scenario 19
 4 5 3         3
3.5 5        3.6 2
2.1 1 5 5
 4 4        3.9 4
R
e
v
ie
w
 A
R
e
v
ie
w
 B
R
e
v
ie
w
 C
R
e
v
ie
w
 D
John
Alice
Bob
Mary
John: Review D
User Rankings
Alice: Review C
Review A
Bob: Review A
Mary: 1. 
2.
Review A
Review C
1. 
1. 
2.
1. 
  4 5 3         3
  4 5         3 2
  2 1 5 5
  4 4         5 4
Ground-truth:
John: Review D
Alice: Review A
Review C
Bob: Review A
Mary: 1. 
2.
Review C
Review A
1. 
1. 
2.
1. 
Figure 2.8: An example of a good regression, but poor ranking prediction. When
regression error is minimized, many inversions in user rankings may occur.
presented to the user; thus, it is too restrictive to consider such option. Reviews are
instead selected and ranked, only requiring to learn their relative positioning.
To elucidate the drawbacks of a rating prediction perspective for review rec-
ommendation, Figure 2.8 shows an example of a prediction with a good regression
performance that incurs in bad ranking performance. A regression approach aims at
exact prediction, so that the shorter the distance between the predicted value an the
true one the better. A perfect regression would result in a perfect ranking, but such
perfection is very hard achieve. Therefore, the existing errors may provide incorrect
orderings.
In such example, we observe two types of failures: (i) medium regression error
for all items, in case of user Alice, that caused an inversion, i.e., swapped pair of
reviews; (ii) low regression error for one item with the cost of a high regression error
for another, as represented by Mary’s situation, promoting an inversion. In general
terms, perfection in rating prediction task is hard to achieve. Thus, when trying to
20 Chapter 2. Concepts and Problem Definition
optimize absolute values, an effort is spent into a useless task. Furthermore, by not
focusing on relative positioning, low ranking performance may occur.
2.3 Concluding Remarks
Recommender Systems have broad applicability as anything may be recommended to
a user: products, locations, services, friends, activities, news, and many others. We
consider the specific recommendation of reviews, a not so typical task in a review
platform, where more often the products described by reviews are recommended.
As such, we borrow several definitions and the overall context of recommender
systems. They are mainly classified as content-based or as collaborative filtering and
involve two types of users and items, cold and warm-start. Cold-start represents a chal-
lenge, especially for collaborative filtering approaches, which have no recommendation
rule in this setting.
A recommender predictor is typically supervised, learning from a previous set of
users’ preferences. Such preferences may be gathered either implicitly or explicitly, the
first suffering from inference errors and the second, from noise. We use a dataset with
explicit ratings, thus an effective predictor for our task has to overcome such noise.
The research about Recommender Systems is slowly moving the perspective from
rating-oriented to ranking-oriented. In fact, in many cases the goal is to produce a
ranking of items, instead of an exact prediction of ratings. We adopt a top-n perspective
in our work, evaluating all solutions using a ranking metric, besides a regression one.
The problem discussed in this work is, then, defined as ranking the reviews in
decreasing order of helpfulness for a given reader and product. The product restricts
the set of reviews in the ranking, while the user implies the right ordering.
Chapter 3
Related Work
Related work includes different strategies for dealing with information overload induced
by reviews, existing approaches for ranking-oriented recommendation and analyses of
the dynamics underlying a review application. Regarding the first topic (Section 3.1),
there are proposals that predict the global helpfulness of reviews (Section 3.1.1). Such
approaches consider the hypothesis that helpfulness is the same for all users, which may
not hold in certain situations. On the other hand, there are also some personalized
approaches of review recommendation (Section 3.1.2). Besides trying to predict a
helpfulness index, there are other ways of extracting useful information from reviews,
for instance, by composing a summary of the main features related to a product (Section
3.1.3). Extracted features may additionally be explored by allowing users to search for
a suitable product (Section 3.1.4). A more typical recommendation problem in review
platforms involves predicting the preferences of users to products (Section 3.1.5).
With respect to the remainder topics, research about recommender systems has
recently focused on changing the problem definition to address the more usual purpose
of ranking (Section 3.2), although not for the specific case of review recommendation.
Finally, there are works analyzing review platforms’ environment, investigating the
process of writing and voting on reviews (Section 3.3).
3.1 Mechanisms for Processing Review Data
Figure 3.1 shows the different strategies of information extraction from reviews and ad-
dresses the trade-off between abstraction level and access to information. Given a target
product, its reviews and user preferences, a review recommendation approach highlights
the most important reviews, as a selection and/or a ranking. If a non-personalized rec-
ommendation is performed, user preferences are not considered. Review summarization
21
22 Chapter 3. Related Work
Review 
Recommendation
Review 
Summarization
Keyword
Search
Product 
Recommendation
user preferences review aggregated information keywordproduct
higher abstraction, lower user effort
higher availability of information, higher user freedom
process an input into an output
Figure 3.1: Different ways of processing review data.
receives as input a product with its reviews, and then derives aggregated information
from them, i.e., a summary. Keyword search processes reviews from all products, possi-
bly beforehand, and finds matching products, given a set of keywords. Finally, product
recommendation applies user preferences to discover matching products, deriving a set
of recommended ones. The last strategy typically uses just ratings, but review text
may be used to augment information about users’ opinion and product features. Al-
though reading reviews requires a greater overhead, review recommendation may be
useful for individuals investigating a product, especially if they already have one in
mind and want to obtain further information.
3.1.1 Global Helpfulness Prediction
Many works propose a method of helpfulness prediction in a non-personalized fashion,
using: regression, aiming at a specific real value [Kim et al, 2006; Lee and Choeh,
2014; Lu et al, 2010]; classification, by dividing helpfulness in non-ordinal categories
[O’Mahony and Smyth, 2009; Zhang and Tran, 2011]; or comparison of the perfor-
mance of both [Liu et al., 2008; Martin and Pu, 2014]. Initial works consider reviews
as standalone documents and extract features from the text to derive the helpfulness
prediction. Examples of features are: number of words, ratio of part-of-speech tags
(such as nouns and adjectives), proportion of negative and positive words after sen-
timent analysis, subjectivity level, and readability. Subsequent works include other
information considering that a user is the author of a review. Then, features are de-
rived from author reputation and his place in the underlying social network, when
3.1. Mechanisms for Processing Review Data 23
available. Number of reviews, average helpfulness received so far and the amount of
third parties who trust him are some possible features for an author.
Under regression hypothesis, different algorithms are applied, such as Linear Re-
gression [Lu et al, 2010], Support Vector Regression [Kim et al, 2006; Martin and Pu,
2014], Neural Networks [Lee and Choeh, 2014], Radial Basis Function [Liu et al., 2008],
and Random Forest Regressor [Martin and Pu, 2014]. Classifiers are also explored using
different methods: O’Mahony and Smyth [2009] investigate JRip, Decision Tree and
Naive Bayes, being the first the best one in their experiments; and Zhang and Tran
[2011] compare Naive Bayes, SVM and a new method based on Information Gain, such
that the latter presents the best performance. All these works assume that there is a
global helpfulness index, independent from a particular user.
3.1.2 Personalized Review Recommendation
Instead of using a global index, a personalized recommendation predicts reviews’ help-
fulness differently for each user. A few works [Moghaddam et al., 2012; Tang et al,
2013] consider user idiosyncrasy and obtain superior results than those disregarding it.
Moghaddam et al. [2012] address the need to recommend reviews and propose four
increasingly sophisticated collaborative filtering approaches using latent factor models.
Matrix Factorization (MF), a well-known model-based algorithm in Recommender Sys-
tem’s field, is the first one and predicts helpfulness as the dot product of reader’s and
review’s latent factors. The second version, Tensor Factorization (TF), exchanges re-
view’s dimensions by its underlying author and product, turning the rating matrix into
a tensor to be factorized into reader, author and product latent matrices. The next
improvement, Extended Tensor Factorization (ETF), consists in adding a new term to
the objective formula such that ratings are explained by the inner product of author
and product latent factors. Finally, they propose unBiased Extended Tensor Factor-
ization (BETF), which accounts for reader and review biases in helpfulness value, and
for author and product biases in rating value.
The four described methods are compared with mean-based predictors, Linear
Regression, Support Vector Regression and User-Based Collaborative Filtering. The
proposed approaches are better than all the others and BETF presents the best per-
formance among them, using Root Mean Squared Error (RMSE). They use 90% of the
data as training set and apply a random split, then not chronological. This setting
may not be reproducible in practice due to requiring a large training size and disre-
garding time constraints. For instance, a lower ratio of cold-starts occurs, favoring the
proposed methods. Besides, experiments are not reported with a confidence level.
24 Chapter 3. Related Work
In a similar personalized fashion, Tang et al [2013] propose Context-Aware Review
Helpfulness Rating Prediction (CAP), a method based on latent variables, but also
using observed information from social and content data. User behavior is divided into
four cases: connecting with others (social network), writing reviews, evaluating reviews
and evaluating products. Each behavior is a source of information for review helpfulness
prediction: connected users tend to give higher grades among them; users with similar
product evaluations also tend to give higher helpfulness votes one to the other; and
author, reader and review properties impact the helpfulness. In this approach, the final
helpfulness is given by a sum of latent factors, each one with a distribution centered on
the regression of observed features. Regression coefficients and latent factors are fitted
using a Monte Carlo Expectation Maximization algorithm.
CAP is compared with mean-based predictions, Linear Regression, Matrix Fac-
torization and a joint Matrix Factorization and Neighborhood-Based Collaborative
Filtering. CAP reveals a better performance using RMSE metric. The experiments
are conducted by chronologically dividing 50% of the dataset for training set and 50%
for test set. The experimental results, however, are not reported with statistical confi-
dence. Additionally, CAP is not compared with the other specialized method, BETF.
3.1.3 Review Summarization
Composing reviews’ summaries is another way of extracting useful information from
reviews, which avoids manual investigation. Solutions of this type aim at concentrating
all important data into a small piece of information, since many reviews repeat the same
features about a product. Existing works compose summaries as: an inverted index
from features to reviews, with sentiments attached [Hu and Liu, 2004; Zhan et al., 2009];
a cloud tag of frequent words and expressions [Wang et al., 2014]; and geographical
visualization of polarity of features [Bjørkelund et al., 2012].
In general, works in this scope process reviews’ texts performing feature extrac-
tion and then sentiment analysis, which associates a polarity to each feature. Such
methods may benefit from a ranking of reviews per user: summaries may be obtained
by prioritizing helpful reviews in a personalized manner.
3.1.4 Keyword Search over Reviews
Reviews provide an enormous source of objective and subjective features not available
under a product’s description. Besides, such features may be tagged with overall posi-
tive and negative sentiments derived from reviews. Then, users interested in a specific
3.2. Recommendation as a Ranking Task 25
set of features may perform keyword search for suitable products such that features
are extracted from reviews’ text. A product may be ranked, for example, according to
the positive score of a desired property.
A few works exist in this scope [Duan et al., 2013; Ganesan and Zhai, 2012]. This
class of methods has a higher abstraction level, since results are products presented to
the user, who does not have to evaluate reviews or summaries. This solution is similar
to a typical product recommendation but with contextual information represented by
the desired set of features instead of user preferences.
3.1.5 Product Recommendation
A most typical recommendation problem in a review platform is recommending re-
viewed objects, avoiding completely the need for reading reviews, summaries or any
additional information about them. Reviewed objects may be products (the generic
name convention), services, locations, news, and many others. Such systems may also
take advantage of reviews, for example, by gathering more information about users and
products [D’Addio and Garcia Manzato, 2014; McAuley and Leskovec, 2013].
Product and review recommendation may be solved with similar approaches,
represented by any abstract recommender system. Collaborative filtering stands out
as a widely used solution, performing well in several applications [Koren et al., 2009].
Notably, Matrix Factorization, a model-based collaborative filtering method, has a
central role in the winner recommender system of Netflix competition [Koren et al.,
2009]. This method considers a set of latent factors for modeling users and items in
the same space. Then, ratings are given by the inner product of a user’s and a reader’s
factors. Nonetheless, the set of reviews is highly dynamic, thus constantly suffering
from cold-start problem.
Hybrid techniques combine content-based and collaborative filtering advantages.
For instance, Regression-based Latent Factor Models (RLFM) is a hybrid technique
that uses latent factors, borrowed from collaborative filtering, that are centered on the
regression of features, commonly adopted for content-based [Agarwal and Chen, 2009].
Not only item features are used, but also related to user and user-item interaction. This
scenario works very well for cold-starts by predicting through regression on features.
3.2 Recommendation as a Ranking Task
Some recent solutions explore the potential of considering ranking optimization to-
gether with collaborative filtering, also known as collaborative ranking. Balakrishnan
26 Chapter 3. Related Work
and Chopra [2012] propose two pointwise and two pairwise models for learning a matrix
factorization model for ranking. A pointwise model defines the loss function based on
each rating, whereas a pairwise solution computes the loss function on pairs of ratings.
The first pointwise approach consists of a two-stage model: (i) matrix factorization to
learn latent factors that become feature vectors for each user and item; then, (ii) a
regression model is built for this input. Another pointwise model is built by a neural
network that simultaneously learns latent factors and regression weights. There are
similarly two pairwise methods, one two-stage and another with a single stage; the
only difference lies in the loss function, which is the cross entropy for the probabilities
of an item to be ranked above another.
The Extended Collaborative Less-is-More Filtering (xCLiMF) [Shi et al., 2013]
extends upon CLiMF [Shi et al., 2012] by allowing non-binary response. The Expected
Reciprocal Ranking (ERR) is maximized, and ratings are represented by the inner
product of user and item latent factors, like in Matrix Factorization. ERR considers
a probabilistic framework for users’ navigation based on a cascade model. xCLiMF
presents better performance than CofiRank, which minimizes normalized Discounted
Cumulative Gain (nDCG) [Weimer et al., 2007].
Christakopoulou and Banerjee [2015] provide three versions of algorithms that
optimize ranking performance focusing on the error at the top of the list and also
model prediction using Matrix Factorization. P-norm Push minimizes the p-norm of
the vector composed by the heights of irrelevant items, consisting of the number of
relevant ones below them. Infinite Push maximizes the infinite norm of the same array,
corresponding to the maximum value. Reverse-Height Push minimizes the reverse
height of relevant items, represented by the number of irrelevant items above them.
However, all solutions use binary responses, not in a scale as we consider here.
In general, collaborative ranking relies heavily on latent factors by extending
collaborative filtering. Consequently, they fail to predict for cold-starts and neglect
the use of observed information for recommendation purpose.
3.3 Analyses of Reviews
Several works analyze review data by searching for patterns underlying the process of
writing and evaluating reviews. Bakhshi et al. [2014] investigate whether exogenous
factors (such as weather, demographics and time of the year) may influence ratings.
If they do, an undesirable situation occurs since evaluations are supposed to consider
endogenous factors only. Weather, for instance, is highly correlated to ratings, which
3.4. Concluding Remarks 27
have lower values in the summer. Moreover, high population density and education
level result in more reviews being produced in the area.
Danescu-Niculescu-Mizil et al. [2009] analyze reviews considering rating and help-
fulness. They note that the closer a rating is to the average product rating, the higher
the received helpfulness. By dividing the products according to rating variance, three
patterns are identified: (i) for low variance, reviews with average rating have more
helpfulness; (ii) for moderate variance, reviews are more useful if the rating is a little
above the average; (iii) for high variance, the helpfulness distribution is bimodal, with
one mode above and the other below the average, being the average a local minimum.
Jurafsky et al. [2014] show that reviews’ text are biased towards positive emotions
as they are above the mean obtained from the Google Books Corpus [Michel et al.,
2011]. The authors also observe that reviews with low ratings often present a narrative
about a trauma. Besides, low-cost restaurants are related to food addiction, whereas
more expensive ones are associated to a complex language, composed by plenty and
longer words, with a sensual perspective of food.
Interestingly, local contexts represented by other reviews may impact the per-
ceived helpfulness. One would expect to observe a rich-get-richer effect on helpfulness
values, in which case the higher a review ranking, the higher the votes it receives.
Nonetheless, users tend to evaluate a review more positively if its current position is
below the deserved one and negatively if above [Sipos et al., 2014]. Voting mecha-
nism converges, i.e., the true helpfulness is reached with time. Besides polarity, voting
participation is also analyzed: reviews in higher positions tend to receive more votes,
which is indeed a rich-get-richer effect, but also reviews far from the deserved ranking.
3.4 Concluding Remarks
Most existing solutions for helpfulness prediction consider that users perceive the help-
fulness in the same way. A few works [Moghaddam et al., 2012; Tang et al, 2013] show
that personalization is more suitable, although they investigate for regression and the
results do not indicate the significance level. Review recommendation approaches are
still scarce and there is a great space left for improvement. We start to explore such
space by assessing a rich set of approaches, from five paradigms, with statistical confi-
dence and for a ranking task.
In spite of item recommendation and keyword-search being useful and promoting
high level of abstraction to the user, in some cases, the user does want to investigate
raw reviews. For example, there may be a situation where a consumer is interested in
28 Chapter 3. Related Work
a specific product and wants to obtain more information; or even wants to compare
two products in detail. In such setting, given the great volume of reviews impossible
to be fully read, the distinction of those more useful has an important role.
Summarizing reviews is a task that, similarly to review recommendation, allows
users to obtain more information about an object of interest and may be used in a
complementary manner. Challenges around this solution are even higher, as process-
ing unstructured data – the reviews, selecting the most important information and
composing a summary in some cohesive way are much farther from the current state-
of-the-art. Other ways to processing review data includes keyword search by extracting
features from reviews and direct product recommendation, possibly using reviews to
augment available information.
A few solutions solve a recommendation problem from a ranking perspective,
but some of them are restricted to binary relevance and they mostly only use latent
features by extending collaborative filtering techniques. In a sparse scenario such as
review recommendation, where reviews are being produced every day, relying on rec-
ommendations only for warm-start entities becomes impractical.
Analyses of dynamics underneath review sites show diverse patterns regarding
conducts for writing and evaluating reviews. They mainly limit the reach of solutions
relying on review data. For example, the influence of exogenous factors and context
dependency make ratings and votes not completely reliable.
We focus on the specific case of review recommendation and evaluate specialized
solutions that were not compared against each other nor with a ranking metric and
statistical tests. We provide a much wealthier comparison of approaches, including five
paradigms, than previously adopted for review recommendation [Moghaddam et al.,
2012; Tang et al, 2013], and additionally with a ranking goal. We are the first one, for
the best of our knowledge, to compare pairwise and listwise learning to rank solutions,
as well as personalized regression and sophisticated recommender systems for review
recommendation. By performing such experimental setting not applied so far, we
identify the best solution and paradigm for ranking from a rich set, and elucidate the
key design aspects for designing an effective solution.
Chapter 4
Helpfulness Voting Dynamics: A
Case Study in Ciao UK
In this Chapter, we study the helpfulness voting dynamics in a dataset from Ciao UK.
We start describing the platform (Section 4.1), presenting an overview of the dataset
(Section 4.2) and then we investigate users participatory behavior (Section 4.3) and
connection over a trust (Section network 4.4). Then, focusing on the purpose of review
recommendation, we compare the universal and individual hypothesis for helpfulness
(Section 4.5). In the following, we represent votes by a set of features and evaluate
their importance for helpfulness prediction (Section 4.6). Finally, we present concluding
remarks for all analyses (Section 4.7).
4.1 A Closer Look at Ciao UK
Ciao UK is a platform prevalent in Europe with a diverse set of reviewed items. There
are 27 categories, including Health, Computers, DVDs, Travel and even Ciao Café,
covering generic topics, from controversial debates to top 10 listings, for which users
may provide their own opinion through a review.
Besides writing standard reviews and voting on them, users may also post videos
reviewing a product, create questions to other users, and provide quick reviews with at
most 150 words. A key advantage of this platform lies in publicly disclosing votes by
users. This is a very rare feature, as this information is not available in the majority
of applications, including Amazon, TripAdvisor and Yelp. Indeed, this is the only
existing platform (to the best of our knowledge) containing each user’s vote, required
for learning and evaluating a personalized review recommender system.
29
30 Chapter 4. Helpfulness Voting Dynamics: A Case Study in Ciao UK
1 2
3 4
5 6
Figure 4.1: Example of reviews in the list of latest ones in Ciao UK. Label 1 indicates
the rating; 2, the review title; 3, the product; 4, the author’s login; 5, the review text;
and 6, the overall helpfulness.
In most review platforms users post reviews without a monetary reward. On the
other hand, Ciao UK allows earnings for each standard review written. The amount
earned depends on the number and values of helpfulness votes received and two prop-
erties of a product: scarcity of available reviews and popularity1.
Figure 4.1 shows a sample from the list of latest reviews (on January 21st, 2016),
with the main elements related to a review: product, author, rating, text, and overall
helpfulness. Each helpfulness value is associated with a label: 0 means off-topic; 1, not
helpful; 2, somewhat helpful; 3, helpful; 4, very helpful; and 5, exceptional.
Figure 4.2 shows a review in detail, including author’s basic reputation and the
distribution of helpfulness given to the review. Personalized votes are publicly visible,
as demonstrated by the login following each vote in the figure. Users do not always
agree on the helpfulness, as shown by the distribution of vote values.
Detailed helpfulness votes are presented in Figure 4.3, for helpful and somewhat
helpful levels. Optionally, a user may supply a comment in addition to a vote, typically
justifying the helpfulness given. In this case, users complain about the lack of details,
missing personal opinion and the format of the review. Not only users differ in the
1Ciao UK: Earning Money. http://www.ciao.co.uk/faq/earning-money,12
4.1. A Closer Look at Ciao UK 31
1 2
3
4
5
6
Figure 4.2: Fragment of a review page in Ciao UK, in which not all users agree on
the review’s helpfulness. Label 1 shows the author login; 2, the rating; 3, basic author
reputation; 4, review text; 5, average helpfulness; and 6, distribution of votes.
Figure 4.3: Example of a review’s comments, which may be supplied with a vote.
In this example, users complain about missing information in the review, a recurring
reason for lower helpfulness votes.
perceived helpfulness, but also in what aspect they like or dislike in a review. Therefore,
a global helpfulness index does not represent well this scenario.
Overall, Ciao UK is a very unorthodox review site that includes different cate-
gories, money rewarding, public helpfulness votes and specific dynamics. For instance,
review payment might bias towards more helpful reviews. Also, publicly disclosed votes
may intimidate true opinions, especially when the writer depends on them for monetary
gain. Such scenario also allows that a group of users exchange many positive helpful-
32 Chapter 4. Helpfulness Voting Dynamics: A Case Study in Ciao UK
Table 4.1: Statistics of the dataset.
Original Filtered
# Reviews 301,011 58,826
# Users 44,239 8,511
# Author 10,965 8,065
# Voters 43,590 1,737
# Authors & Voters 10,316 1,291
# Products 112,837 1,908
# Helpfulness Votes 8,870,598 457,697
Mean # Votes by Review 29.46 8.05
Reviews’ Time Span 05.31.00 - 09.25.11
ness votes, misleading the veracity of helpfulness evaluations. Although such situations
may occur, we believe they do not heavily impact the experiments conducted in this
work. Nonetheless, we leave such investigation for future work.
4.2 Dataset Overview
We use a publicly available dataset crawled from Ciao UK2. The crawling was performed
through a breadth-first search starting with active users [Tang et al, 2013]. The date
of the crawling is not available; all we know is that the last review was from 2011 and
the work describing it was published in 2013.
To properly use this dataset, we disregard instances with empty fields, invalid
rating values, invalid fields’ format and texts with more than 40% of words outside
WordNet dictionary3 [Miller, 1995]. Also, since we adopt a ranking perspective for
recommendation, we evaluate the performance of any technique using ranking metrics,
thus requiring rankings of reasonable sizes. In our problem, each group of votes for
the same reader-product pair provides a ranking, which corresponds to the reviews of
the product evaluated by the given user. Figure 4.4 shows that most rankings are very
small in the dataset considered. Indeed, 41.46% are composed by only one element,
and the median value is two. Thus, we filtered the dataset to contain helpfulness votes
associated to rankings with ten elements or more. The final dataset has around 5% of
the original number of helpfulness votes, which covers around half million votes, being
still a large sample. Table 4.1 has statistics of the original and filtered datasets.
2Ciao UK Dataset. http://www.jiliang.xyz/Ciao.rar
3WordNet is an English dictionary; thus we focus on English reviews.
4.3. Participatory Behavior 33
1 2 3 4 5 6 7 8 9 10+
Ranking Size
Fr
eq
ue
nc
y 
(x1
03 )
0
1
2
3
4
5
41
.4
6%
12
.2
8%
7.
04
%
4.
84
%
3.
26
%
2.
85
%
2.
03
%
1.
62
%
1.
63
%
22
.9
7%
Figure 4.4: Distribution of ranking sizes in the original dataset.
Chronologically dividing the dataset into training and test sets is an important
procedure, since otherwise future data may be used in the learning process. However,
the dataset used does not contain votes’ timestamps for performing such split. Hence,
we use reviews’ timestamps as an approximation for its votes’ ones. Therefore, in a
chronological split, votes from the same review lie in the same set – training or test
– configuring all reviews as cold-starts. This situation severely penalizes collaborative
filtering approaches and requires a workaround, as we further discuss in Section 5.7.
Although this situation imposes certain restrictions, no other dataset with individual
helpfulness votes by each user was found, neither a platform that makes this information
publicly available.
4.3 Participatory Behavior
In order to understand how users interact with the platform and their different profiles,
we investigate patterns of activity and popularity. Figure 4.5a shows that many authors
have written a few reviews, and a few authors have written many; the trend repeats
for the number of votes received. A similar pattern is observed for the number of votes
given by voters (Figure 4.5b) and the number of votes for a review (Figure 4.5c). All
distributions are heavy-tailed, pointing to very different user and review profiles in
34 Chapter 4. Helpfulness Voting Dynamics: A Case Study in Ciao UK
Count
# 
Au
th
or
s
100 101 102 103
10
0
10
1
10
2
10
3
Count
# Reviews
# Votes Received
(a)
# Votes Given
# 
Vo
te
rs
100 101 102 103
10
0
10
0.
5
10
1
10
1.
5
10
2
(b)
# Votes Received
# 
R
ev
ie
w
s
100 100.5 101 101.5 102
10
0
10
1
10
2
10
3
10
4
(c)
Figure 4.5: Distributions of activity and popularity, in log-log scale. Figure (a)
contains the distribution of the number of reviews written and number of votes received
per author. Figure (b) presents the number of votes given by users, as voters. Figure
(c) shows the number of votes received per review.
0 1 2 3 4 5
Product Rating
Fr
eq
ue
nc
y 
(x1
05
)
0.
02
%
9.
42
%
3.
46
%
14
.8
4% 19
.3
6%
52
.9
0%
0
1
2
3
(a)
0 1 2 3 4 5
Helpfulness Vote
Fr
eq
ue
nc
y 
(x1
05
)
0
1
2
3
4
5
0.
06
%
0.
26
%
1.
84
%
8.
78
%
86
.9
6%
2.
09
%
(b)
Figure 4.6: Distributions of evaluation values: ratings as evaluations of products in
Figure (a) and votes as evaluations of reviews in Figure (b).
the platform, the majority of entities with a few related reviews or votes [Easley and
Kleinberg, 2010]. In this context, recommender systems are relevant as, otherwise, the
vast amount of unpopular reviews in the long tail4 would not be easily discovered [Yin
et al, 2012]. In turn, recommending to very different users, whose available information
is of discrepant sizes, is a tough task. Thus, personalization would be advantageous,
not only to consider different user preferences, but also different participatory profiles.
Figure 4.6 presents the distribution of users’ evaluation as ratings and votes. Both
distributions are very concentrated, with clear peaks around high ratings or votes. The
latter peak is even more intense as over 85% of reviews have helpfulness vote equal to 4.
4Considering a plot with reviews sorted by popularity in x-axis and popularity in y-axis.
4.3. Participatory Behavior 35
# Votes Given
# 
R
ev
ie
w
s 
W
rit
te
n
100 100.5 101 101.5 102 102.5
10
1
10
2
10
3
10
4
(a) Author vs. voter role
0 1 2 3 4 5
0
50
10
0
15
0
20
0
Mean Helpfulness
# 
R
ev
ie
w
s 
W
rit
te
n
(b) # Reviews vs. mean vote
0 1 2 3 4 5
Mean Helpfulness
# 
Vo
te
s 
R
ec
ei
ve
d
0
10
00
20
00
30
00
(c) # Votes vs. mean vote
Figure 4.7: Scatter plots correlating user participation and mean helpfulness.
The reason for such concentrated distribution may lie in users being more compelled
to evaluate when they like a product or a review. In such scenario, any predictor
outputting values close to 4 has a reasonably low regression error, which is another
motivation for prioritizing ranking metrics. Since votes are publicly displayed in Ciao
UK, users might feel intimidated to give low grades. Reviews with ratings 1 are more
frequent than 2, a possible indication that people are also more motivated to vote if
they extremely dislike a product than if they moderately dislike. Nonetheless, this does
not occur for helpfulness votes; after all, the cost of buying a bad product is higher
than the cost of reading a bad review. Another distinction between ratings and votes
lies in the most frequent value: while it is 4 for votes, it is 5 for ratings, suggesting
that users easily consider reviews as helpful, but hardly as perfect.
We further analyze authors’ profiles by correlating their behavior as voter and as
author. In such study, we use both Pearson correlation coefficient ρ and Spearman’s
rank correlation coefficient ρs. The first measures the strength of a linear correlation
between two variables, whereas the second computes the correlation among rankings
of two lists of observations. In turn, Spearman correlation indicates if the increase
or decrease of one variable, by any proportion, is followed by a parallel growth or
reduction of the other. Both coefficients occur in the range [−1, 1], where −1 implies
total negative correlation, 1 total positive correlation and 0, no correlation at all. The
computation is performed using the following formulas:
ρxy =
∑n
i=1(xi − x¯)(yi − y¯)√∑n
i=1(xi − x¯)2
√∑n
i=1(yi − y¯)2
(4.1)
ρsxy =1−
6
∑n
i=1 d
2
i
n(n2 − 1) , (4.2)
36 Chapter 4. Helpfulness Voting Dynamics: A Case Study in Ciao UK
where x and y are pairwise observations, di is the difference of ranking positions of i-th
observation in x and y rankings, and n is the number of observations.
Figure 4.7a shows the scatter plot of the number of reviews written versus the
number of votes given to an author, in a log-log scale. There is a weak to ρ = 0.3751,
such that users tend to participate similarly in both scenarios; however, there is a
reasonable variance of the strength of each role among users. Indeed, ρs = 0.4455, a
moderate value. Such correlation indicates that, if a user participate more in one role,
it tends to be more active in the other, but not necessarily in the same proportion.
Then, we explore how user activity and popularity correlates with helpfulness.
We first consider the correlation of writing activity, represented by the number of
reviews created, with the average helpfulness received by each author. The scatter plot
of such comparison is in Figure 4.7b. To investigate if the increase of the number of
votes coincides with higher helpfulness, we compute both correlation metrics, obtaining
ρ = 0.2300 and ρs = 0.2557. When correlating average helpfulness with user popularity,
measured as the number of votes received, we observe a lower value for Pearson and
a greater one for Spearman, with ρ = 0.2276 and ρs = 0.3266. The scatter plot of
this correlation is shown in Figure 4.7c. Thus, popularity is more important than
activity for predicting relative helpfulness, although not linearly, as its peak around
4 is more prominent than for activity. Indeed, popularity is an aggregated form of
readers’ feedback while activity is not, thus working better for helpfulness prediction.
Overall, there are underactive and unpopular users with a wide range of values for
average helpfulness, in which case personalization may help distinguishing among them.
However, it is clear that highly popular and active users typically have a high average
helpfulness, around 4.
Although we use only one dataset, it is representative as it resembles the dynamics
of other platforms. For instance, an analysis of a dataset from Epinions5 includes a
heavy tail distribution of users/reviews participation and popularity, and a highly
unbalanced distribution of helpfulness [Moghaddam et al., 2012].
4.4 User Connections over a Trust Network
Ciao UK platform has an additional feature that allows users to interact through a
special social network representing trust. Such network is directed; thus, when a user
trusts another, the reciprocal is not necessarily true. Table 4.2 contains statistics from
the trust network. Indeed, far from resembling a friendship network, whose edges are
5The link for such dataset is no longer available.
4.4. User Connections over a Trust Network 37
Table 4.2: Statistics of the Trust Network.
Network Largest SCC
# Nodes 18, 998 9, 629 (50.67%)
# Edges 145, 528 126, 867 (87.18%)
% Reciprocal Edges 34.21 40.72
Density 4.03× 10−4 1.37× 10−3
Average In/Out-Degree 7.66 13.18
Average Shortest Path Length1 4.75 4.52
Average Clustering Coefficient 0.12 0.16
Diameter ∞ 15
# WCCs 1 1
# SCCs 8, 928 1
Max # nodes in remaining SCCs 7 0
reciprocal by definition, there are only around one third of reciprocal edges. We observe
a fair average in-degree and out-degree6, whereas the density is low. Such values are
increased for the largest Strongly Connected Component (SCC). The degrees are really
concentrated, as the log-log scale distribution in Figure 4.8 shows. Surprisingly, the
concentration is higher for out-degree, such that it is easier for a user to be highly active
and trust many others than to be popular and trusted by a representative amount.
Approximately half of users are in the largest Strongly Connected Component,
whereas the other half gathers in groups of at most seven nodes. Nonetheless, all of
them are in the same Weakly Connected Component (WCC)7. The largest SCC is very
connected, with higher density, higher clustering coefficient and diameter of 15. As
we observe an engagement in using such network by some users, it may be a valuable
source of information for recommendation, albeit not available in all cases.
An illustration of the largest SCC of the trust network is shown in Figure 4.9.
Some users are highly trusted, represented by deep red and large nodes in the center,
and gather many others around them. There are also two poles, using a force-directed
layout8 for visualization, with one more concentrated of high in-degree nodes, and the
other with a more distributed in-degree across nodes. Finally, there are also some local
influencers, i.e., with several trustors, represented by red nodes away from the center.
6Their average is the same, since the sum of in-degree equals the sum of out-degree.
7We believe, because of this and all the authors being in the trust network, that the crawling was
performed through links of this network. The other possibility was to crawl reviews of users who
voted in reviews of the already crawled users.
8Nodes suffer simultaneously attraction from connected nodes and repulsion from all nodes.
38 Chapter 4. Helpfulness Voting Dynamics: A Case Study in Ciao UK
Degree
# 
N
od
es
100 100.5 101 101.5 102 102.5
10
0
10
1
10
2
10
3
10
4
In−Degree
Out−Degree
Figure 4.8: Distribution of in-degree and out-degree in the network, in log-log scale.
Figure 4.9: Subgraph of the trust network corresponding to the largest SCC.
4.5. Review Helpfulness: Universal or Individual? 39
0 1 2 3 4 5
0
1
2
3
4
5
Mean Helpfulness in 1st Half
M
ea
n 
H
el
pf
ul
ne
ss
 in
 2
n
d  
H
al
f
(a) Mean vote in 1st vs. 2nd half
l l l l l l
lll
ll
ll
ll
ll
ll
ll
ll
ll
ll
ll
ll
ll
ll
l
ll
ll
ll
ll
l
ll
l
ll
ll
ll
ll
ll
ll
lll
ll
lll
ll
lll
l
lll
lll
lllll
lllllll
lllllllllllllllllllllllllllllllllllllllllllllllllllllllll
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
x (CV of Helpfulness)
P(
X≤
x)
100 100.5 101 101.5 102 102.5
l
Author, Voter
Review
Author
(b) CDF of helpfulness CV
Figure 4.10: Investigation of author consistency and different entities conformity.
4.5 Review Helpfulness: Universal or Individual?
So far, we have claimed that review helpfulness is usually seen as a universal index,
common to all users while, it should be an individual index, personalized to each user.
In this section, we explore pieces of evidence of the extent of helpfulness explanation
by a universal viewpoint and if there is indeed potential for a particular perspective.
First, we investigate the Author Consistency Hypothesis, which claims that dif-
ferent reviews of the same author tend to have the same quality and was validated
in previous works [Lu et al, 2010]. We proceed as follows: the votes received by each
user are divided chronologically in half, then the mean of each half is computed. We
want to contrast if there is a tendency of both values being equal, so we use Pearson
correlation. A high ρ is observed between the first and the second halves, equal to
0.7361 (while ρs = 7052 is weaker). This shows that, indeed, an author’s votes tend
to remain close in helpfulness. A scatter plot of the means of each author in the first
and second halves is presented in Figure 4.10a. This also suggests that, when readers
assess a review, they are mainly assessing the underlying author.
We further analyze the conformity of votes given to an author. Such conformity
represents the degree of accordance among different readers for the helpfulness of an
author. Thus, high conformity evidences that the author explains well the helpfulness
value. In this analysis, we use the Coefficient of Variation (CV) [Jain, 1991], defined
as the percentage of the standard deviation (σ) relative to the mean (µ):
CV = 100
σ
µ
(4.3)
40 Chapter 4. Helpfulness Voting Dynamics: A Case Study in Ciao UK
As the CV represents the level of variability, a low value indicates a high confor-
mity. We divide the set of votes according to the corresponding author and compute
the CV for each author’s group. Besides conformity related to authors, we also evaluate
such index related to reviews and author-reader pairs. Note that, when considering
author and review groupings, a non-personalized conformity is evaluated. For instance,
a high review conformity indicates that review’s intrinsic characteristics, derived from
its text, justify great part of helpfulness values. On the other hand, when evaluating
for an author-reader pair, we consider a personalized conformity, as such groupings are
specific to a reader9. Thus, we are assessing how the interaction between author and
reader account for the observed helpfulness.
The Cumulative Distribution Function (CDF) for the CV related to authors, re-
views and author-reader pairs is presented in Figure 4.10b. Around 20% of author
votes, 65% of review votes and 78% of author-reader votes have zero variation. Al-
though authors tend to be consistent, reviews are much more. Additionally, while a
non-personalized solution explains great part of helpfulness of a review (as suggested
by a low CV of votes belonging to a review) we observe a representative decrease of
variation by considering each reader’s opinion for an author.
We have further computed the confidence interval [Jain, 1991] for the mean CV
for each entity, with 95% of confidence. The CV grouped by author presents an interval
for the mean of 12.6628± 0.2908; grouped by review, of 5.2900± 0.1058; and grouped
by author-voter, of 3.9624± 0.0651. No interval intercepts with another, which means
that each entity conformity have significantly different means. Thus, the personal-
ized author-reader conformity is significantly higher than non-personalized review and
author, represented by lower mean of CV. Therefore, the author-reader pair, a person-
alized setting, explains helpfulness votes better than the review and the author, both
non-personalized entities. This strongly motivates the pursuit of personalized methods
for review helpfulness prediction as the next frontier for improvement.
Finally, Figure 4.11 correlates the CV of helpfulness with the number of votes.
If higher CVs are correlated with higher number of votes, than author groupings may
be penalized since they tend to be larger. However, we observe an opposite trend:
higher number of votes typically have lower CV in all groupings. Following this trend,
author-voter pairs should have higher CV since they usually have lower number of
votes. Though, that is not the pattern observed: author-voter pairs have lower CV
despite having lower votes. Then, such situation could be explained by other factors
instead of the number of votes, probably the interaction between author and voter.
9It is not possible to assess review-reader conformity, as each reader can only give one vote to a review
in most platforms, including Ciao UK.
4.6. Characterization through Features 41
l l ll ll l
l
lll
l
llll
l
l l
l
ll ll ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll l
l
l
ll
l
l
l
l
l
l
l
l
l l
l
l
l
ll
l
l ll l lll
l
l
l
l
l
l
l
l
l
ll
l
l
l
ll
l
l
l
ll
ll
l
l
l
l
l
l
l
l
l
l
ll
l
l
l l
l
l
l ll
l
lll
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
lll
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
0 50 100 150 200 250 300
0
50
0
10
00
15
00
20
00
25
00
30
00
CV of Helpfulness
# 
Vo
te
s
(a) Author
0 50 100 150 200 250 300
0
50
10
0
15
0
20
0
25
0
CV of Helpfulness
# 
Vo
te
s
(b) Review
0 20 40 60 80 100 120 140
0
20
40
60
80
10
0
12
0
CV of Helpfulness
# 
Vo
te
s
(c) Author-voter
Figure 4.11: Scatter plot of CV versus number of votes for each grouping.
4.6 Characterization through Features
Some techniques considered in this work use features derived from the dataset for
prediction. We use the set of features defined for CAP, a recently proposed review
recommendation approach [Tang et al, 2013]. Table 4.3 briefly describes all features.
Review’s features are derived from review text, containing metrics such as length-based,
part-of-speech tagging and sentiment statistics. Author’s features capture their repu-
tation and past behavior. These first two sets contain non-personalized information,
i.e., they are the same for all readers, and were defined in an earlier work [Lu et al,
2010]. Reader’s features include historical data of the target user of recommendation.
Author-reader similarity’s features capture similarity level between the author and
the reader of a review. Author-reader connection’s features measure the connection
strength between an author and a reader, following the trend of social recommenda-
tion [Machanavajjhala et al., 2011] and the potential of the underlying trust network.
Some of these features are not precisely defined in the original works; thus, we make
some assumptions presented in Appendix A.
We evaluate the impact of features, first, by using Information Gain (InfoGain)
[Yang and Pedersen, 1997]. This metric computes the entropy (i.e., impurity) change
by splitting the data according to a feature value. In such calculation, the initial
entropy on the whole data is computed and, then, the average entropy after splitting
the data according to the value of a feature (for example, a group for feature A equals
to 1, another for equals to 2 and so on). This impacts on how well a feature in isolation
is able to distinguish between all helpfulness values.
Table 4.4 presents the top 10 features regarding information gain. The first five
features are only non-personalized ones. In fact, there is a great part of review help-
42 Chapter 4. Helpfulness Voting Dynamics: A Case Study in Ciao UK
Table 4.3: Features defined in the literature with a brief description. Each feature is
derived from an entity or relationship, represented in fragment headers.
Feature Description
Review
r_num_tokens Number of tokens, which are sequences of non-whitespace characters.
r_num_sents Number of sentences.
r_uni_ratio Unique words ratio.
r_avg_sent Average sentence length.
r_cap_sent Number of capitalized sentences.
r_noun_ratio Ratio of tokens classified as nouns.
r_adj_ratio Ratio of tokens classified as adjectives.
r_comp_ratio Ratio of tokens classified as comparatives.
r_verb_ratio Ratio of tokens classified as verbs.
r_adv_ratio Ratio of tokens classified as adverbs.
r_fw_ratio Ratio of tokens classified as foreign words.
r_sym_ratio Ratio of tokens classified as symbols.
r_num_ratio Ratio of tokens classified as numbers.
r_punct_ratio Ratio of tokens classified as punctuation.
r_kl KL divergence of review text relative to all reviews of the same product.
r_pos_ratio Ratio of sentences tagged as positive.
r_neg_ratio Ratio of sentences tagged as negative.
Author
a_num_reviews Number of reviews written by the author.
a_avg_rating Average rating given by the author.
a_num_trustors Number of users who trust in the author.
a_num_trustees Number of users trusted by the author.
a_pagerank Author’s PageRank score in the trust network.
Reader
v_avg_rat Average rating given by the voter.
v_avg_rat_soc Average rating of the social network.
v_avg_rat_sim Average rating of similar users.
v_avg_help Average vote given by the voter.
v_avg_help_tru Average vote given by voter’s trust network.
v_avg_help_sim Average vote given by similar users.
v_num_trustors Number of users who trust in the voter.
v_num_trustees Number of users trusted by the voter.
v_pagerank Voter’s PageRank score in the trust network.
Author-Reader Similarity
s_comm_rated Number of products rated in common.
s_jacc_rated Jaccard coefficient of ratings.
s_cos_ratings Cosine similarity of ratings.
s_pear_ratings Pearson correlation of ratings.
s_diff_avg_rat Difference of author and voter average ratings.
s_diff_max_rat Difference of author and voter maximum ratings.
s_diff_min_rat Difference of author and voter minimum ratings.
Author-Reader Connection
c_jacc_trustors Jaccard coefficient of trustors of users.
c_jacc_trustees Jaccard coefficient of trustees of users.
c_adam_trustees Adamic-Adar score of common trustors.
c_adam_trustees Adamic-Adar score of common trustees.
c_katz Katz index of paths from voter to author.
4.6. Characterization through Features 43
Table 4.4: Top 10 features with higher InfoGain.
Position Feature InfoGain
1 a_pagerank 0.2189
2 r_num_tokens 0.1973
3 r_uni_ratio 0.1359
4 r_num_sents 0.1252
5 a_num_trustees 0.1145
6 v_avg_help 0.1128
7 v_pagerank 0.0922
8 v_avg_rat_soc 0.0922
9 v_avg_rat_sim 0.0911
10 v_avg_help_sim 0.0898
fulness that is explained by a global quality derived from review and author. However,
the remainder features are all related to the voter. Thus, the same way recommender
systems try to improve upon popularity prediction, incorporating personalization in
solutions can surpass non-personalized ones.
Table 4.5: Top 10 features with higher absolute value of ρs. Features also in InfoGain
Top 10 are followed by a red circle.
Position Feature ρs
1 r_num_tokens 0.4099
2 r_num_sents 0.3500
3 r_uni_ratio −0.3302
4 v_avg_help 0.3134
5 a_num_trustees 0.3115
6 a_num_trustors 0.2798
7 c_jacc_trustees 0.2380
8 c_adam_trustees 0.2200
9 c_adam_trustors 0.2164
10 a_num_reviews 0.2018
We also assess the relative importance of different features by correlation ρs,
defined in Section 4.3, with helpfulness. This metric is useful since we consider a top-n
recommendation viewpoint. Table 4.5 shows that half of the features in ρs are also
in the top 10 of InfoGain and there are only non-personalized ones until top 3. In
general, no evaluated feature alone explains the helpfulness value fully. However, we
observe that connection features share some relevance using this metric. In fact, 30%
44 Chapter 4. Helpfulness Voting Dynamics: A Case Study in Ciao UK
0.0 0.1 0.2 0.3
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
x (c_jacc_trustees)
P(
X≥
x)
(a) Jaccard of trustees
0 20 40 60 80 100
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
x (a_num_trustors)
P(
X≥
x)
(b) Authors’ # trustors
0 200 400 600 800 1200
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
x (a_num_trustees)
P(
X≥
x)
(c) Authors’ # trustees
Figure 4.12: Distribution of Jaccard coefficient of trustees (Figure (a)), of authors’
number of trustors (Figure (b)), and authors’ number of trustees (Figure (c)).
of the times there is no overlap of trustees at all (Figure 4.12a), then c_jacc_trustees
has considerable importance given that it is unavailable in many cases. Especially for
ranking, connection features take a more representative place. The same importance
is not observed for similarity features. Indeed, they are implicitly derived from user
behavior while connection features are based on explicit expression of trust by users.
Only r_uni_ratio is negatively correlated; probably many unique words may indicate
a less readable text or with a lot of typos.
Surprisingly, the number of authors’ trustees (their out-degree) is more correlated
than the number of trustors (their in-degree). Figures 4.12c and 4.12b show that the
number of trustees vary more widely and might distinguish authors better. As already
indicated by the trust network analysis, users tend to have higher out-degree than
in-degree. In such scenario, high participation by trusting many people is positively
correlated with helpfulness, more than popularity for having many trustors.
We visually present the correlation with helpfulness for features that are both in
InfoGain and ρs top 5 (except for r_num_sents, which presents a similar pattern of
r_num_tokens). Figure 4.13a shows that, in general, the higher the number of tokens
of a review, the higher the helpfulness vote of a review, as it is more likely to be a
thorough description of the product. However, helpfulness votes equal to 1 or less do
not follow such pattern. Thus, an extremely rejection is not very correlated with the
size of the review. Figure 4.13b shows that helpfulness increases with the reduction of
the ratio of unique words, but this trend is not observed for helpfulness equal to 1 or
less. Finally, Figure 4.13c shows that the number of trustees is typically increases with
helpfulness; but for ratings between 0 and 2, it has practically the same distribution.
Overall, we note that low helpfulness votes seems to follow a different pattern
4.7. Concluding Remarks 45
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
ll
l
l
l
l
l
l
l
l
l
l
ll
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
ll
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
ll
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
0 1 2 3 4 5
0
20
00
40
00
60
00
80
00
Helpfulness Vote
r_
n
u
m
_
to
ke
n
s
(a) Number of tokens
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
ll
l
l
ll
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
lll
l
l
ll
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
0 1 2 3 4 5
0.
0
0.
2
0.
4
0.
6
0.
8
Helpfulness Vote
r_
u
n
i_
ra
tio
(b) Unique words ratio
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
0 1 2 3 4 5
0
20
0
40
0
60
0
80
0
10
00
Helpfulness Vote
a
_
nu
m
_
tru
st
ee
s
(c) Authors’ # trustees
Figure 4.13: Box plots for features in both InfoGain and ρs top 5 (except the number
of sentences), for each helpfulness level.
than medium and high ones, using the set of features specified. Such low evaluations
may be related to very private and specific conditions. For example, an offensive phrase
or recall of a bad memory may cause an extreme rejection by a reader. Such particular
conditions are a challenge to review recommendation task. Nonetheless, there may be
other unexplored features that explain low helpfulness cases better.
4.7 Concluding Remarks
We use a dataset crawled from Ciao UK for our analyses as no other platform (to the
best of our knowledge) currently displays publicly the helpfulness votes, target of our
prediction. As commonly occurs in many others, such dataset contains very different
profiles of users and reviews related to participation and popularity. In such scenario,
while recommending for users with very different sizes of associated historic data is
hard, the purpose of review recommendation is strengthen by the volume of reviews
with few evaluations and by distinct user profiles. A good recommender system may
find good reviews in the long tail for a user that, otherwise, would hardly discover
them, in a personalized way.
In a 0 to 5 scale, the high ratio of votes equal to 4 shows that users easily
find reviews as helpful. Indeed, the author’s collaboration is often viewed positively.
Nonetheless, distinguishing helpfulness equal to 4 and 5 is still important, as well as
both of them from reviews with helpfulness 3 or less. Authors tend to be consistent
in quality across reviews, as their average helpfulness tends to be stable. The most
popular and active authors in the platform are very often helpful, but not vice-versa.
Predicting the helpfulness among unpopular authors remains a challenge.
46 Chapter 4. Helpfulness Voting Dynamics: A Case Study in Ciao UK
Authors and reviews have little variation in vote values across readers, thus a great
part of helpfulness is expressed by review and author quality. However, votes grouped
by author and reader have an increased conformity, indicating that a personalized
viewpoint implied by each reader explains better the helpfulness values.
The trust network underlying the platform is a promising source of information
for review recommendation, as some users heavily use them. Indeed, connection fea-
tures appear in Spearman correlation’s top 10 features. Only non-personalized features
are in the top 3 of both rankings of InfoGain and ρs. Thus, there is a great part of
helpfulness votes that corresponds to content’s intrinsic quality. Nonetheless, person-
alized features occur in the both feature rankings, an evidence that overcoming the
current performance boundary may be accomplished by a personalized prediction.
Chapter 5
Multiple-Paradigm Techniques
We consider techniques from multiple paradigms in our comparison to assess the per-
formance of review recommender systems under different assumptions. In this chapter,
we present each paradigm with overall characteristics as well as specifics of each so-
lution. By assessing very distinct approaches, we obtain a wealthy set of methods to
analyze performance on review recommendation, not considered so far.
Specifically, specialized review recommendations, included in our comparison,
were not compared against each other yet. Additionally, regressors were applied for
this problem only in a non-personalized format, while we use a personalized viewpoint
for all methods. Finally, we include learning to rank techniques, and a sophisticated
recommender, Regression-based Latent Factor Models, both not previously evaluated
for this task. By comparing such rich set of methods, we can derive the best predictor
and identify key aspects that provide better performance for this task.
5.1 Paradigms
We evaluate methods from several paradigms with different levels of generality. Al-
gorithms with well known competitive performance (e.g. LambdaMART) and widely
used simple ones (as linear regression) are included in the comparison.
Each paradigm has a different generality level. We denote the group of methods
of a given paradigm as a class. We describe some of them using naming conventions
from their original scenario. Table 5.1 maps their original names to the specific one.
The following classes are considered:
• Mean-based Predictors (MBP) compute mean statistics from the training set and
use them for prediction (Section 5.2);
47
48 Chapter 5. Multiple-Paradigm Techniques
Table 5.1: Mapping of naming conventions from the original definition of LTR and
GRS classes (generic) to the review recommendation scenario (specific).
Generic Specific
LTR
document review
query reader-product pair
relevance vote
GRS
item review
user reader
rating vote
• Regressors (REG) predict a real value, as close as possible to the response, for
each set of features representing an instance (Section 5.3);
• Learning to Rank Predictors (LTR) determine a ranking of documents for a query
given observed features (Section 5.4);
• General-Purpose Recommender Systems (GRS) predict a ranking of items by
utility for each user (Section 5.5);
• Review Recommender Systems (RRS) obtain a ranking of reviews by helpfulness
for a reader-product pair (Section 5.6).
Such classes are defined with increasing specialization, e.g., a REG method may
be applied in place of a GRS technique, but not necessarily the other way around.
Finally, to consistently present the methods in a standardized way, Table 5.2
presents their shared symbols and definitions, using previous definitions from Section
2.2. In the following, we present the classes and their techniques.
5.2 Mean-based Predictors
Mean-Based Predictors are very simple: they compute mean values in the training set
and use them as predictions. When the mean is not available (e.g., the training set
does not have the votes of an author when using author’s mean), any method uses the
overall mean instead. There are three techniques in this class, as follows.
Overall Mean Vote (OM). This method predicts every helpfulness as the mean
value of all votes in the training set. Thereby, the utility function is computed as:
τ(ui, rj) =
1
|H0|
∑
h∈H0
yh (5.1)
5.2. Mean-based Predictors 49
Table 5.2: Symbols used for defining the methods.
yh True value of helpfulness for a vote h ∈ H.
ph
Predicted value of helpfulness for a vote h ∈ H, i.e.:
ph = τ(ui, rj) : pih(h) = (rj , ui)
fCLS
Function mapping a user u and a review r to the input expected by algorithms
in class CLS. The domain is always U ×R, but the codomain is class-specific.
g
Logistic or sigmoid function, computed as:
g(x) =
1
1 + e−x
vTw
Inner product of vectors v and w of dimension d, defined as:
vTw =
d∑
i=1
viwi
M·,c Column vector composed by elements of all rows in column c of matrix M.
Mr,· Row vector composed by elements of all columns in row r of matrix M.
Mm×n Specifies that matrix M has m rows and n columns.
‖v‖
The euclidean or L2-norm of vector v of dimension d, computed as:
‖v‖ =
√√√√ d∑
i=1
v2i
‖M‖Fro
The generalization of euclidean norm to matrices. Considering Mm×n, then:
‖M‖Fro =
√√√√ m∑
i=1
n∑
j=1
M2ij
De
Number of features of a given entity e ∈ {v, r, a, s, c} denoting voter (reader),
review, author, similarity and connection features, respectively.
{x1...n}
Set of variables xi where i ranges from 1 to n, i.e.:
{x1...n} = {x1, x2, . . . , xn}
xe
Vector of features associated to an entity or relation e, which may be: voter
vi ∈ U , review rj ∈ R, author ak ∈ U , similarity sik or connection cik between
voter vi ∈ U and author ak ∈ U (or a composition of the previous ones).
H0vi
Set of votes given by voter vj ∈ V0, i.e.:
H0vi = {h ∈ H0|∃r ∈ R0 : pih(h) = (r, vi)}
H0rj
Set of votes given to review rj ∈ R0, i.e.:
H0rj = {h ∈ H0|∃v ∈ V0 : pih(h) = (rj , v)}
H0ak
Set of votes given to reviews of author ak ∈ A0, i.e.:
H0ak = {h ∈ H0|∃r ∈ R0, v ∈ V0, p ∈ P0 : pih(h) = (r, v) ∧ pir(r) = (p, ak)}
H0viak
Set of votes given by voter vi ∈ V0 to reviews of author ak ∈ A0, i.e.:
H0sik = {h ∈ H0|∃r ∈ R0, p ∈ P0 : pih(h) = (r, vi) ∧ pir(r) = (p, ak)}
50 Chapter 5. Multiple-Paradigm Techniques
Note that the prediction does not depend on user ui nor review rj. Indeed, a
constant value is predicted for all cases.
Author’s Mean Vote (AM). For each author, AM computes the mean of votes
associated to all reviews written by that author in the training set. Then, any vote
associated to review ri and user ui is predicted as the mean of review’s author, i.e.:
τ(ui, rj) =
1
|H0ak |
∑
h∈H0ak
yh, ∃p ∈ P0 : pir(rj) = (p, ak) (5.2)
In short, we restrict the set of votes as the ones given to the author of rj and use
the mean of such set as prediction.
Voter’s Mean Vote (VM). For each voter, VM computes the mean of all votes
given by that voter in the training set. Then, prediction is performed as:
τ(ui, rj) =
1
|H0ui |
∑
h∈H0ui
yh (5.3)
In this case, the user ui restricts the set of votes whose mean used for prediction.
Methods in this class are mainly used for regression, as OM and VM provide equal
predictions for reviews in the same ranking. Since rankings are composed by grouping
reviews’ votes for a reader-product pair, the voter (reader) mean and the overall mean
are the same for all reviews in a ranking. AM is the only one that estimates differently
for reviews in the same ranking.
5.3 Regressors
Regressors (REG) use a set of features (e.g.: number of characters of a review, number
of reviews written by the author, average helpfulness given by the reader, etc.) to
predict an approximation of the review’s true helpfulness for a given user. Solutions
of this type consider the ordinal nature of the output, opposed to classifiers. Pre-
liminary experiments show that the latter presents worse performance then regressors
counterparts; hence, they are not considered.
When used for ranking, REG is called pointwise due to learning scores for a
query-document pair in isolation [Li, 2011]. Composing a review profile through fea-
tures distinguishes this class as a content-based recommender. Moreover, as a common
model is used for all users, who share coefficients, REG is also a collaborative filtering
approach. Therefore, we apply methods in this class in a hybrid format1.
1We did not find any reference for such classification of this class.
5.3. Regressors 51
Using a regressor for review recommendation requires to apply a transformation
to the input as fREG : U ×R 7→ Rn, representing a mapping to a vector of n features.
The utility function τ is derived with image close to the range of original votes, as
algorithms in this class aim at absolute predictions. Three representatives of this class
are considered in our experiments and described next.
Linear Regression (LR). This predictor learns a linear model to compute a
vote given features related to a reader and a review. The predicted vote is a linear
combination of these feature values added to a constant, the intercept:
τ(ui, rj) = w
Txij + b, xij = fREG(ui, rj), (5.4)
where w is a vector of coefficients and b is the intercept; both are adjusted in the
learning process from past transactions. Different loss functions might be used for
fitting coefficients. Typically, the least squares loss is applied with L2 regularization
[Ng, 2004], called Ridge Regression [Murphy, 2012], whose objective function is:
ELR =
1
2
∑
h∈H0
(ph − yh)2 + β
2
‖w‖2, (5.5)
where β is the regularization factor that rules the importance of the regularization
term. Regularization is used to avoid overfitting, a phenomenon of overspecialization
in the training set such that the fitted model represents not only the true pattern
of the data, but also noise [Ng, 2004]. This situation results in bad performance in
the test set. Lower coefficients tend to generate simpler models with a shorter gap of
performance between training and test sets; thus, regularization is usually addressed
by the norm of coefficients.
Support Vector Regression (SVR). SVR corresponds to the regressor version
of Support Vector Machine classifier (SVM) [Vapnik, 1995; Vasconcelos et al., 2015].
A predictor is built by minimizing both model complexity (represented by the norm
of the vector of coefficients) and absolute errors of data points outside a margin of
tolerance ε. As a consequence, the model focuses on minimizing large errors only.
The trade-off between model complexity and intolerable errors is determined by
parameter C, the penalty factor. The higher the value of C, the more errors are penal-
ized and the more complex is the learned function. Representing nonlinear relations
between features is possible by applying a kernel transformation.
When a linear kernel is used, the predictive function has the same format as LR,
as presented in Equation 5.4. However, SVR does not consider errors below the ε
tolerance while optimizing. The optimization goal is defined as follows:
52 Chapter 5. Multiple-Paradigm Techniques
minimize ESV R = C
∑
h∈H0
(ξh + ξ
∗
h) +
1
2
‖w‖2
subject to ∀ h ∈ H0 :
yh −wTxij − b ≤ ε+ ξh
wTxij + b− yh ≤ ε+ ξ∗h
ξh, ξ
∗
h ≥ 0,
pih(h) = (u, r),xij = fREG(u, r),
(5.6)
where w and b are similar to LR, and ξh is the error of a lower prediction than the true
value and ξ∗h, of a greater prediction, both for vote h. Like LR, an L2 regularization is
traditionally applied for regularization; but differently, errors are considered as absolute
deviations beyond ε, not squared. The trade-off between error and regularization costs
is addressed by C instead of β, but replacing C = 1
β
results in an equivalent definition.
Figure 5.1 presents an example of a linear SVR model, representing the errors beyond
the tolerance from above or below.
Gradient Boosting Regression Trees (GBRT). A regression algorithm with
usual high performance for ranking is GBRT [Lucchese et al, 2015], also known as
Multiple Additive Regression Trees (MART) [Friedman, 2001]. GBRT learns a non-
linear function by combining several predictive functions of the same type (i.e., an
ensemble) whose weighted sum of returned values provides the final prediction.
The model is built stage-wise by extending the current function through adding a
new decision tree using as ground-truth the pseudoresponses [Burges, 2010]. Pseudore-
sponses are the negative of the loss function’s derivative evaluated at each instance.
Therefore, a model update simulates a gradient descent step, which changes the cur-
rent function towards the negative of the gradient in order to reduce the error. In this
ε
ξh
ξ*h
Figure 5.1: Illustration of a linear SVR model. The solid line represents the predictive
model and dashed lines delimit tolerable deviations. Only instances whose prediction
is beyond the tolerance are considered as errors.
5.4. Learning to Rank Predictors 53
case, an approximation of the gradient is computed through fitting a decision tree on
pseudoresponses. The utility function has a format as follows:
τ(ui, rj) =
T∑
t=1
γtFt(xij), xij = fREG(ui, rj), (5.7)
where Ft is a decision tree built in the t-th iteration fitting on pseudoresponses, T is the
number of decision trees and γt is the weight given for the result of tree Ft. The value
of γt is typically chosen to minimize the loss function when the new tree is added, for
instance, using Newton’s method [Burges, 2010]. The value predicted after reaching a
leaf node is, in most cases, just the average of training samples falling there.
5.4 Learning to Rank Predictors
Learning to Rank (LTR) approaches build a model to predict a ranking of documents
given a query. Using features of a query-document pair, the model outputs a rele-
vance of the document for the query. Relevance levels are estimated by considering
only relative ordering and may even occur in a completely different but more advanta-
geous range (e.g., less expensive to compute, or with a better predictive function upon
features).
We evaluate two strategies: RankSVM [Chapelle and Keerthi, 2010] and Lamb-
daMART [Burges, 2010]. They build predictive models using different strategies – pair-
wise and listwise, respectively – that represent a fair coverage of existing approaches.
Such strategies differ in how the loss function is computed. A pairwise approach op-
timizes a loss defined over pairs of documents, typically as a classification problem for
one being above or below the other. In turn, a listwise method tries to directly opti-
mize a ranking loss function computed on all queries’ rankings [Li, 2011]. Similarly to
REG class, LTR techniques are applied with item profiles and common user weights,
thus being hybrid recommenders2.
An LTR predictor employs an input transformation of the form fLTR : U ×R 7→
Q×Rn, where Q is a set of query identifiers and n is the number of features. The set Q
is only used during the training phase for grouping reviews into rankings. Then, using
only a feature vector, the predictor is able to output a relevance score. The utility
function τ provides outputs not necessarily in the same range of votes, since this class
aims at position correctness, not absolute predictions.
2Similarly as for REG class, we also did not find any reference for this classification.
54 Chapter 5. Multiple-Paradigm Techniques
RankSVM (RSVM). A representative of a pairwise learning to rank is RSVM,
a version of SVM such that the number of inversions between documents in the same
query is minimized [Chapelle and Keerthi, 2010; Canuto et al., 2013]. Thus, the prob-
lem becomes a classification task solved by SVM, such that input vectors are the
differences of vectors from two instances. The two classes are: positive, when the first
vector is ranked above; and negative, when it is below. The goal may be stated as:
minimize ERSVM = C
∑
h∈H0P
ξh,h′ +
1
2
‖w‖2
subject to ∀ h, h′ ∈ H0P :
y(wT (x− x′)) ≥ 1− ξh,h′
ξh ≥ 0,
pih(h) = (u, r), pih(h
′) = (u, r′),
pir(r) = (p, a), pir(r
′) = (p, a′),
x = mREG(u, r), x
′ = mREG(u, r′),
y = +1↔ yh > yh′ , y = −1↔ yh < yh′ ,
(5.8)
where H0P is the set of pairs of votes in training set, restricted to votes in the same
ranking. In such definition, h and h′ are restricted to belong to the same voter u and
corresponding reviews r and r′ associated to the same product p. A strict ordering
is assumed to assign classes; thus, pairs with equal relevance are disregarded. The
hyperplane is defined as wT (x− x′) = 0. No error occurs when positive examples are
above the upper margin of the hyperplane wT (x− x′) ≥ 1 and negative examples are
below the lower margin wT (xi − x′) ≤ −1. An error such that 0 < ξh,h′ ≤ 1‖w‖2 occurs
margin = 1/||w||2
ξh,h'
ξh,h'
+1
-1
Figure 5.2: Illustration of a RSVM model. There are two classes, a positive and
a negative, and each pair corresponds to a classified instance. Any pair beyond the
correct margin provides an error equal to the distance to such margin.
5.5. General-Purpose Recommender Systems 55
for examples within the margin, whereas an error such that ξh,h′ > 1‖w‖2 is derived for
misplaced instances beyond the margin, which lie on the wrong class.
The vector w is normal to the hyperplane built separating the two classes such
that the size of the margin is 1‖w‖ . Thus, by minimizing the norm of w, the margin is
maximized, making classes farther apart. If wT (x − x′) > 0, then the model predicts
that h is higher than h′, and r is ranked above r′. We can similarly state that wTx >
wTx′. Thus, the predicted relevance for each instance equals to wTx, similar to Linear
Regression only without the intercept. A kernel transformation can be used to compose
nonlinear separating curves. Figure 5.2 presents a linear model of RSVM by dividing
pairs into two classes, and shows the two types of errors.
LambdaMART (LMART). LMART is a combination of Gradient Boosting
Regression Trees (GBRT) and LambdaRank [Burges, 2010] and has a central role in
the winner technique of Yahoo! Learning to Rank Challenge [Canuto et al., 2013;
Chapelle and Chang, 2011]. It empirically optimizes an arbitrary ranking function
(e.g., the normalized Discounted Cumulative Gain at position p – nDCG@p). Such
direct optimization over rankings distinguishes this technique as listwise [Balog and
Ramampiaro, 2013]. Similarly to LambdaRank, a document relevance is updated based
on the presence of inversions and proportionally to the score gain after swapping doc-
uments; however, it is not a pairwise technique as such updates are aggregated by
rankings. Using these updates, an ensemble is built in a stage-wise fashion, as GBRT.
Let d and d′ be a pair of documents such that d is more relevant, denoted as
d  d′. This pair has an associated lambda variable λd,d′ that is proportional to
the derivative of the probability of d  d′ obtained from the current predictive model.
Lambda variables are also proportional to the variation of a ranking cost function when
the pair is swapped. These variables are summed for all pairs involving a document d,
such that a term is positive if d is above the other or negative if it is below. The overall
λd of a document represents a force pushing it up or down the ranking. In LMART,
such values are used as pseudoresponses and the trees are built similarly to GBRT,
composing a similar predictive function. Opposed to LambdaRank, which updates
all coefficients after evaluating a query, LambdaMART considers all queries jointly by
dividing all documents over the tree leaves.
5.5 General-Purpose Recommender Systems
General-Purpose Recommender Systems (GRS) learn a model for suggesting items,
preferentially in a personalized fashion. Fitting such model requires a set of previous
56 Chapter 5. Multiple-Paradigm Techniques
ratings given by users to items, composing the user-item matrix of ratings. Two repre-
sentatives are addressed: Matrix Factorization [Koren et al., 2009], a model-based col-
laborative filtering technique, and RLFM [Agarwal and Chen, 2009], a hybrid method.
GRS require an identity transformation of the input as fGRS : U × R 7→ U ×R.
The utility function τ is obtained by minimize ordering error of item’s ranking for a
given user. Although we define this class as a top-n recommender system, the methods
described here are originally defined to predict an exact value for item’s utility, which
is unnecessary and too restrictive for a ranking task.
Matrix Factorization (MF). This technique models user and item in a K-
dimensional space of latent factors. The inner product of user and item latent factors
estimates the rating given by a user to an item. The meanings of learned vectors’
dimensions are unknown, but they represent properties of items and users that explain
existing ratings. They might correspond, for instance, to category, style and even depth
of characters [Koren et al., 2009]. Prediction is performed as:
τ(ui, rj) = u
T
i rj, (5.9)
where ui is the latent vector of user ui, and rj is the latent vector of review rj, both
with K dimensions. The dot product reflects the similarity between user and item.
MF may be solved by factorizing the matrix of ratings using Singular Value
Decomposition (SVD). However, SVD is undefined when the matrix has missing values,
so an imputation technique must be applied. This may not be the best solution as it
increases the amount of data and may distort the model [Koren et al., 2009]. Instead,
the problem may be stated as a generic optimization and solved using algorithms such
as stochastic gradient descent or alternating least squares [Koren et al., 2009]. In this
setting, regularization may be additionally introduced, such as an L2 regularization for
latent vectors, weighted by regularization factor β:
EMF =
1
2
∑
h∈H0
(ph − yh)2 + β
2
(
∑
ui∈U0
‖ui‖2 +
∑
rj∈R0
‖rj‖2) (5.10)
The model may be adapted to add biases by jointly adjusting bias variables in
the optimization algorithm [Ricci et al, 2010]. Thus, using a similar notation of Section
2.1.4, the prediction becomes:
τ(ui, rj) = µ+ bui + brj + u
T
i rj, (5.11)
5.5. General-Purpose Recommender Systems 57
Table 5.3: Definition of variables used by RLFM.
K Number of dimensions of latent vectors.
αi
Scalar latent variable related to user ui ∈ V0 following a gaussian distribution
centered on the linear regression with coefficients g of user features:
αi ∼ N (gTxui , σα)
βj
Scalar latent variable related to item rj , ak : rj ∈ R0, ak ∈ A0 following a gaussian
distribution centered on the linear regression with coefficients d of item features:
βj ∼ N (dTxrjak , σβ)
ui
Latent vector of length K related to user ui ∈ V0 following a multivariate
gaussian distribution centered on K linear regressions, the k-th one with
coefficients G·,k, of user features:
ui ∼MVN (GTxui ,Au)
vj
Latent vector of length K related to item rj , ak : rj ∈ R0, ak ∈ A0 following a
multivariate gaussian distribution centered on K linear regressions, the k-th one
with coefficients D·,k, of item features:
vj ∼MVN (DTxrjak ,Av)
xij
Vector of interaction features. In our specific case, it is the concatenation of xsij
and xcij .
b Linear regression coefficients of xij .
In short, matrix factorization is a widely used and flexible method. It received
many attention recently due to its fundamental role in the winner technique of the
Netflix Prize Competition [Koren et al., 2009].
Regression-based Latent Factor Model (RLFM). As a representative of
a hybrid technique, RLFM is a method that combines both the benefits of accuracy
from collaborative filtering and the ability of recommending for item cold-starts from
content-based approaches [Agarwal and Chen, 2009]. The rating given by user ui to
item rj is predicted as follows:
τ(ui, rj) = x
T
ijb + αi + βj + u
T
i vj, (5.12)
where the variables are detailed in Table 5.3. Since RLFM is not specialized for review
recommendation, author’s features are not considered separately, but with review’s ones
composing item features. Regression term on interaction features, xTijb, builds the base
58 Chapter 5. Multiple-Paradigm Techniques
prediction value. Variables αi and βj represent user and item biases, respectively, but
tied with features through a linear regression. The inner product uTi vj is just like in
Matrix Factorization, but centered on the regression of features in contrast. Whenever
a cold-start is encountered, the regression of the corresponding features replaces the
latent variable.
Latent variables and parameters are learned through a Monte Carlo Expectation
Maximization (MCEM) procedure [Booth and Hobert, 1999]. A Expectation Maxi-
mization (EM) method learns parameters with incomplete data, represented by latent
variables. This situation complicates directly maximization of log-likelihood as there
is no closed form solution. EM adjusts parameter values iteratively and guarantees
convergence to a local optimum of the log-likelihood [Do and Batzoglou, 2008].
Overall, the fitting process comprises a series of iterations representing refine-
ments on the maximization of the log-likelihood. An iteration is divided in two phases:
E-step computes the expectation of the log-likelihood on posterior distribution of latent
factors, and M-step adjusts parameter by maximizing the expectation.
Specifically, latent factors and parameters are initialized before execution. Then,
in E-step, the expectation is not in closed form due to latent vectors inner product
term uTi vj, with an unknown distribution. Gibbs Sampler, a Monte Carlo method,
is applied to approximate the unknown distribution of latent factors conditioned on
observed data and parameters [Murphy, 2012]. Afterwards, the empiric mean of latent
factors provides an approximation of the expectation. Thus, the EM applied is a
special version designated MCEM due to the presence of a Monte Carlo sampling. In
this scenario, convergence to a local optimum is not guaranteed any longer, although
it may still converge under regularity conditions, such as suitable number of iterations
and stopping criterion [Booth and Hobert, 1999; Agarwal and Chen, 2009]. In M-step,
parameters are adjusted through regression on empirical means, as this maximizes the
expectation. Any off-the-shelf regression fitting procedure may be used.
5.6 Review Recommender Systems
Recent works have explored the hypothesis that users do not perceive helpfulness in the
same manner; thus, relevant information for review recommendation is incorporated in
a personalized fashion. We compare two methods: BETF [Moghaddam et al., 2012], a
collaborative filtering technique3, and CAP [Tang et al, 2013], a hybrid one. To the best
3BETF uses reviews’ ratings as additional information. However, we do not consider this a content-
based trait because they are not used for computing a similarity directly. Instead, BETF jointly fits
two collaborative filtering recommender systems: one for rating and one for helpfulness.
5.6. Review Recommender Systems 59
of our knowledge, no previous work compares representatives of this class against each
other. Also, experiments conducted with them do not provide statistical significance
nor assessment using a ranking metric.
We present the representatives of this class in detail as their original papers lack
specifics for a precise implementation. Specifically, Moghaddam et al. [2012] present
the algorithm, the derivatives and the loss function for ETF, but not for BETF, which
has a slightly different algorithm and loss. Tang et al [2013] do not include formulas
related to most variables, the log-likelihood nor the algorithm. Therefore, we present
a deeper description to fill the gap of definitions for such methods.
Review Recommender Systems (RRS) apply the identity transformation of the
input as fRRS : U × R 7→ U × R. The utility function τ returns a value indicating a
relative helpfulness and aiming at the lowest error for rankings of reviews. Similarly to
GRS, techniques considered in this class were originally defined to compute absolute
values, but they could calculate only a relative score instead.
UnBiased Extended Tensor Factorization (BETF). BETF is derived from
Matrix Factorization with the following extensions: (i) the review dimension is replaced
by two others, product and author, and the method becomes a Tensor Factorization;
(ii) a new term is added to the objective function: sum of squared differences of ratings
and the inner products of author and product latent factors; (iii) accounting for voter
and review biases in helpfulness prediction, and for author and product biases in rating
prediction [Moghaddam et al., 2012]. Then, the estimated helpfulness of review rj to
user ui is computed as:
τ(ui, rj) = g(µr + h¯ui + h¯rj + s×u ui ×v vk ×p pl), pir(rj) = (ak, pl), (5.13)
where the variables are explained in Table 5.4. Bias variables are static, calculated in
a preprocessing step. There are bias variables related to voter and review impacting
vote values, and related to author and product impacting rating values (not used in
prediction, only in fitting).
Tensor factorization is performed using Tucker model, which incorporates a cen-
tral tensor [Shi et al., 2014]. To identify the factorized matrices cells generating each
tensor cell, the tensor dimensions are named: u, v and p. Then, the tensor inner
product is given by [Moghaddam et al., 2012]:
s×u ui ×v vk ×p pl =
K∑
x=1
K∑
y=1
K∑
z=1
sxyzuixvkyplz (5.14)
60 Chapter 5. Multiple-Paradigm Techniques
Table 5.4: Definition of variables used by BETF.
K Number of dimensions of latent vectors.
µh Overall mean of helpfulness in the training set.
µr Overall mean of rating in the training set.
s Central tensor of shape K ×K ×K, representing dimensions’ weights.
ui
Latent vector of length K related to voter ui ∈ V0 following a gaussian
distribution centered on a zero-vector:
ui ∼ N (0, σ2uI)
vk
Latent vector of length K related to author ak ∈ A0 following a gaussian
distribution centered on a zero-vector:
vk ∼ N (0, σ2vI)
pl
Latent vector of length K related to product pl ∈ P0 following a gaussian
distribution centered on a zero-vector:
pl ∼ N (0, σ2pI)
h¯ui
Bias of voter ui ∈ V0 for helpfulness, calculated as the average deviation of user
ui votes from the overall mean.
h¯rj
Bias of review rj ∈ R0 for helpfulness, calculated as the average deviation of
review rj votes from the sum of overall mean and corresponding voter bias.
¯rak
Bias of author ak ∈ A0 for rating, calculated as the average deviation of user ak
ratings from the overall mean.
r¯pl
Bias of product pl ∈ P0, calculated as the average deviation of product pl ratings
from the sum of overall mean and corresponding author bias.
Latent variables are fitted to minimize the error [Moghaddam et al., 2012], com-
posed of three terms: (i) sum of squared errors on vote prediction; (ii) sum of squared
errors on rating prediction; and (iii) regularization of latent factors and central tensor.
Note that two models are simultaneously fitted: one for helpfulness and the other for
rating. Such error may be represented as4:
EBETF =
1
2
∑
h∈H0
(yh − ph)2 + 1
2
∑
r∈R0
(yr − pr)2 + β
2
∑
ui∈V0
‖ui‖2Fro
+
β
2
∑
ak∈A0
‖vk‖2Fro +
β
2
∑
pl∈P0
‖pl‖2Fro +
β
2
‖s‖2Fro,
(5.15)
4The original definition has one regularization factor for each type of latent factor, but all of them
were applied with an equal value in the experiments. Therefore, we use a single regularization factor.
5.6. Review Recommender Systems 61
Table 5.5: Formulas used by BETF algorithm.
∂EhBETF
∂ui
= (ph − yh)g′(hˆ)(s×v vk ×p pl) + βui
∂EhBETF
∂vk
= (ph − yh)g′(hˆ)(s×u ui ×p pl) + (pr − yr)g′(rˆ)pl + βvj
∂EhBETF
∂pl
= (ph − yh)g′(hˆ)(s×u ui ×v vk) + (pr − yr)g′(rˆ)vk + βpl
where :
hˆ = ph = µh + h¯ui + h¯rj + s×u ui ×v vk ×p pl
rˆ = µr + ¯rak + r¯pl + v
T
k pl
pih = (rj , ui), pir(rj) = (pl, ak)
where variables yr and pr represent the true rating and the predicted rating, respec-
tively. The only purpose of fitting ratings is to improve the adjustment of vk and pl
latent vectors by using additional data from ratings.
The fitting process uses a Stochastic Gradient Descent (SGD) algorithm to find
a local minimum. SGD runs over several iterations and updates the unknown latent
variables and coefficients towards the negative of the gradient of the loss function, which
is the direction in which the loss reduces the most. The stochastic version considers the
loss function for a single example, computing the derivative and adjusting parameters
by observing only one instance. As the complete gradient is not computed, the result
is stochastic, not necessarily following the direction of the real gradient. However,
by doing updates more frequently, convergence is typically reached much faster. The
larger the training set, the more costly to compute the complete derivative; thus SGD
is an efficient approach for large scale. The update occurs as follows:
φt+1 = φt − α∂E
i
∂φ
(φt), (5.16)
where Ei is the cost related to a single example i, φ is any variable being adjusted,
φt is the value of the variable after the t-th update and α is the learning rate which
rules the magnitude of the step towards the gradient. Table 5.5 presents the partial
derivatives used in BETF and Appendix B.3 contains their mathematical proofs.
Algorithm 1 represents the BETF procedure. Voters, authors and products are
represented as indexes in the rating matrix and helpfulness tensor, received as input.
Although the traditional stochastic gradients have terms for helpfulness error, rating er-
ror and regularization, the original proposal of BETF performs the updates differently.
First, variables are updated considering only the terms related to helpfulness error for
each vote, rating error for each review and then regularization for each variable. This
way, helpfulness error is emphasized due to the greater cardinality of votes.
62 Chapter 5. Multiple-Paradigm Techniques
Algorithm 1: Algorithm of BETF Review Recommender System
Data: integer: K, max_iter ; real: tol, α0, β; tensor: H|V
0|×|A0|×|P0|, R|A
0|×|P0|
Result: utility function τ
1 initialize uK×11...V0 , v
K×1
1...A0 , p
K×1
1...P0 , s
K×K×K with small random values;
2 scale Hikl and Rkl in [0, 1], ∀ i, j, k;
3 µh ← mean({Hikl | i = 1 . . . |V0|, k = 1 . . . |A0|, l = 1 . . .P0,Hikl 6= ?});
4 µr ← mean({Rkl | k = 1 . . . |A0|, l = 1 . . . |P0|,Rkl 6= ?});
5 for i = 1 to |V0| do h¯ui ← 1|{k,l|Hikl 6= ?}|
∑
k,l:Hikl 6= ?(Hikl − µh) ;
6 for k, l : k = 1 to |A0|, l = 1 to |P0| do ¯hrkl ← 1|{i|Hikl 6= ?}|
∑
i:Hikl 6= ?(Hikl − µh − h¯ui) ;†
7 for k = 1 to |A| do ¯rak ← 1|{l|Rkl 6= ?}|
∑
l:Rkl 6= ?(Rkl − µr) ;
8 for l = 1 to |P0| do r¯pl ← 1|{k|Rkl 6= ?}|
∑
k:Rkl 6= ?(Rkl − µr − ¯rak) ;
9 t← 1; α← α0;
10 for it = 1 . . .max_iter do
11 α← α√
t
; t← t+ 1;
12 foreach (i, k, l) | Hikl 6= ? do
13 hˆikl ← µh + h¯ui + ¯hrkl + s×u ui ×v vk ×p pl;
14 u′i ← ui − αg′(hˆikl)(g(hˆikl)−Hikl)(s×v vk ×p pl);
15 a′k ← ak − αg′(hˆikl)(g(hˆikl)−Hikl)(s×u ui ×p pl);
16 p′l ← pl − αg′(hˆikl)(g(hˆikl)−Hikl)(s×u ui ×v vk);
17 s′ ← s− αg′(hˆikl)(g(hˆikl)−Hikl)(ui ⊗ vk ⊗ pl);
18 ui ← u′i; ak ← a′k; pl ← p′l; s← s′;
19 end
20 foreach (k, l) | Rkl 6= ? do
21 rˆkl ← µr + ¯rak + r¯pl + aTk pl;
22 a′k ← ak − αg′(rˆkl)(g(rˆkl)−Rkl)pl;
23 p′l ← pk − αg′(rˆkl)(g(rˆkl)−Rkl)ak;
24 ak ← a′k; pl ← p′l;
25 end
26 for i = 1 to |V0| do ui ← ui − αβui ;
27 for i = k to |A0| do vk ← vk − αβvk ;
28 for i = k to |P0| do pl ← pl − αβpl ;
29 if previous_loss - current_loss < tol then
30 break;
31 end
32 end
33 return τ(ui, rkl) = 5× g(µh + h¯ui + h¯aj + h¯pk + s×u ui ×v vj ×p pk);
†We index reviews by author and product indices k and l, since we do not manipulate reviews explicitly.
However, if we use function pir(rj) = (ak, pl) to link the entities, then rkl = rj .
The complexity of the algorithm is O(max_iter ·K3 · |H0|). In practice, we do
not iterate over all the tensor H checking for defined values; instead, we iterate over
the training set H0. The main cost is related to fitting the votes due to |H0| ≥ |R0|,
as each review is related to one or more votes. Similarly, |H0| ≥ |V0|, |A0|, |P0|, then
such cost also surpasses initialization and regularization overheads. We note that the
complexity is linear with the number of iterations and the size of the training set, in
number of votes, but cubic with the number of dimensions of latent factors K. Thus,
the increase of K severely penalizes the computational cost.
5.6. Review Recommender Systems 63
Table 5.6: Definition of variables used by CAP.
K Number of dimensions of latent vectors.
hij
Helpfulness vote given by voter ui ∈ V0 to review rj ∈ R0. Its true value is
denoted by yhij and predicted, phij = τ(ui, rj), related as:
yhij ∼ N (phij , σ2h)
αi
Scalar latent variable related to user ui ∈ V0 following a gaussian distribution
centered on the linear regression with coefficients g of voter features, i.e.:
αi ∼ N (gTxui , σ2α)
βj
Scalar latent variable related to review rj ∈ R0 following a gaussian distribution
centered on the linear regression with coefficients d of review features, i.e.:
βj ∼ N (dTxrj , σ2β)
ξk
Scalar latent variable related to user ak ∈ A0 following a gaussian distribution
centered on the linear regression with coefficients b of author features, i.e.:
ξk ∼ N (bTxak , σ2ξ )
ui
Latent vector of length K related to user ui ∈ V0 following a gaussian
distribution centered on K linear regressions, the k-th with coefficients W·,k, of
voter features, i.e.:
ui ∼MVN (WTxui ,Au)
vj
Latent vector of length K related to review aj ∈ A0 following a gaussian
distribution centered on K linear regressions, the k-th with coefficients V·,k, of
review features, i.e.:
vj ∼MVN (VTxrj ,Av)
δ1(i, k)
Indicator function which is 1 if user i is similar (using cosine of ratings) to user
k above the similarity average of i with all the other users, and 0 otherwise.
δ2(i, k)
Indicator function which is 1 if user i trusts user k in trust network of the
platform, and 0 otherwise.
γki
Scalar latent variable related to voter ui ∈ V0 and author ak ∈ A0 following a
gaussian distribution centered on the sigmoid function applied to the linear
regression with coefficients r of author-voter similarity features, i.e.:
γki ∼ N (g(rTxsik), σ2γ)
λki
Scalar latent variable related to voter ui ∈ V0 and author ak ∈ A0 following a
gaussian distribution centered on the sigmoid function applied to the linear
regression with coefficients h of author-voter connection features, i.e.:
λki ∼ N (g(hTxcik), σ2λ)
Context-aware Review Helpfulness Rating Prediction (CAP). By in-
corporating content and social contexts into the recommendation model, CAP is a
hybrid technique relying on both latent and observed features. Four user actions in the
platform are addressed: connecting to other through the trust network, rating items,
64 Chapter 5. Multiple-Paradigm Techniques
Table 5.7: Definition of sets used by CAP.
S0i
Set of similar users of ui ∈ V0, i.e.:
S0i = {ak ∈ A0|δ1(i, k) = 1}
C0i
Set of trustees of ui ∈ V0, i.e.:
C0i = {ak ∈ A0|δ2(i, k) = 1}
S0 Set of all strong similarity pairs, defined as:S0 = {(ui, ak) ∈ V0 ×A0|ak ∈ S0i }
C0 Set of all strong connection pairs, defined as:C0 = {(ui, ak) ∈ V0 ×A0|ak ∈ C0i }
writing reviews, and evaluating helpfulness. Based on each action, a context is derived
from a vote transaction that helps the prediction.
Overall, one content context and four social contexts are explored. The content
context refers to the review text. Social contexts are related to: author context, de-
rived from their reputation of writing reviews; rater context, obtained from past vote
transactions; connection context, defined by the social network interaction of author
and voter; and similarity context, derived from rating similarity of author and voter.
Then, CAP extends Regression-Based Latent Factor Models (RLFM) to review recom-
mendation by incorporating each such context [Tang et al, 2013]. The vote of user ui
to review rj is predicted as:
τ(ui, rj) = u
T
i vj + αi + βj + ξk + δ1(i, k)γ
k
i + δ2(i, k)λ
k
i , (5.17)
whose variables are defined in Table 5.6. This prediction differs from RLFM’s in four
ways: (i) author and review variables are considered separately in βj and ξk; (ii)
author features are not used for computing vj mean; (iii) dyadic features xij are split
into two linear regressions with a logistic transformation; (iv) dyadic regressions are
transformed by a sigmoid function and wrapped into latent variables γki and λki , and
only added under certain conditions through indicator variables.
CAP applies MCEM to fit latent variables and parameters like RLFM, but differs
on how parameters are adjusted in M-step as specific regression procedures are used.
Ordinary Least Squares (OLS) [Murphy, 2012] fits regression parameters related to all
latent variables, except γki and λki , which are solved by Newton-Raphson [Ypma, 1995]
due to sigmoid transformation upon the regression.
The goal of an EM algorithm is to maximize the log-likelihood. In the special
case of CAP algorithm, the log-likelihood is given by:
5.6. Review Recommender Systems 65
`Ω,Θ = C − 1
2
∑
ui,rj∈V0×R0
(
1
σ2H
(yhij − uTi vj − αi − βj − ξk − δ1(i, k)γki − δ2(i, k)λki )2 + log σ2h
)
− 1
2
∑
ui∈V0
(
(αi − dTxui)2
σ2α
+ log σ2α + (ui −WTxui)TA−1u (ui −WTxui) + log(detAu)
)
− 1
2
∑
rj∈R0
(
(βj − gTxrj )2
σ2β
+ log σ2β + (vj −VTxrj )TA−1v (vi −VTxrj ) + log(detAv)
)
− 1
2
∑
ak∈A0
(
log σ2ξ +
(ξk − bTxak)2
σ2ξ
)
− 1
2
∑
ui,ak∈S0
(
log σ2γ +
(γki − g(rTxsik))2
σ2γ
)
− 1
2
∑
ui,ak∈C0
(
log σ2λ +
(λki − g(hTxcik))2
σ2λ
)
,
(5.18)
where the derivation of such value is presented in Appendix B.1, the sets C0 and S0
are defined in Table 5.7, C is a constant that does not depend on latent variables Ω
nor parameters Θ, defined as:
Ω = {α1...|V0|} ∪ {β1...|R0|} ∪ {ξ1...|A0|} ∪ {γ1...|S
0
i |
1...|V0|} ∪ {λ
1...|C0i |
1...|V0|}
∪ {uK×1
1...|V0|} ∪ {vK×11...|R0|}
(5.19)
Θ = {gDv×1,dDr×1,bDa×1, rDs×1,hDc×1,WDv×K ,VDr×K ,
σ2H , σ
2
α, σ
2
β, σ
2
η, σ
2
γ, σ
2
λ,A
K×K
u ,A
K×K
v }
(5.20)
The likelihood is computed as follows. The response yhij depends on the latent
variables. Then, the complete likelihood is given by the dot product of the probability
of the response, given the variables, by the probability of the latent variables. As such
variables are assumed to be independent, their probability is just the product of the
probability of each one. All of them are assumed to follow a gaussian distribution and,
after applying the log and performing several simplifications, Equation 5.19 is reached.
An interaction variable, γki or λki , only impacts the helpfulness if the corresponding
indicator variable, δ1(i, k) or δ2(i, k), is 1. Thus, all pairs with indices i, k such that
k /∈ S0i and k /∈ C0i are outside the model.
In the E-step, as the product uTi vj corresponds to the inner product of two
multivariate normal distributions and does not have a known density, the posterior
distribution of latent factors given observed ratings and parameters P (Ω|H0,Θ) is ap-
proximated by Gibbs Sampling, exactly the same as in RLFM. Using such method
requires the values of the empiric mean and variance of the distribution of any variable
66 Chapter 5. Multiple-Paradigm Techniques
Table 5.8: Formulas used by CAP algorithm.
Formula Similar
E(αi|Ω− {αi}) = V (αi|Ω− {αi})
(
dTxui
σ2α
+
∑
h∈Hui
hy−hp+αi
σ2H
)
βj , ξk
V (αi|Ω− {αi}) =
(
1
σ2α
+
|Hui |
σ2H
)−1
E(γki |Ω− {γki }) = V (γki |Ω− {αi})
(
g(rTx
sk
i
)
σ2γ
+
∑
h∈Hsik
hy−hp+γki
σ2H
)
λki
V (γki |Ω− {γki }) =
(
1
σ2γ
+
|Hsik |
σ2H
)−1
E(ui|Ω− {ui}) = V (ui|Ω− {ui})
(
A−1u G
Txui +
∑
h∈Hui
(hy−hp+uTi vj)vj
σ2H
)
vj
V (ui|Ω− ui) =
(
A−1u +
∑
h∈Hui
vjv
T
j
σ2H
)−1
∂E∗
∂r =
1
σ2γ
∑
ui,ak∈S0
(
E∗{δlij}
(γki )− g(rTxsik)
)
g′(rTxsik)xsik h
∂2E∗
∂r∂rT
= 1σ2γ
∑
ui,ak∈S0
(
(E∗{δlij}
(γki )− g(rTxsik))g′′(rTxsik)− (g′(rTxsik))2
)
xsikxsik
given all the others, i.e., for a variable φ, E(φ|Ω − {φ}) and V (φ|Ω − {φ}). Approxi-
mations for the mean is the point φˆ where the derivative of the log-likelihood is zero
`′(φˆ) = 0 and for the variance, the inverse of the second derivative applied to φˆ, i.e.,
`′′(φˆ)−1 [Booth and Hobert, 1999]. The formulas for the mean and variance for each
variable are given in Table 5.8. Variables with similar formulas are presented in column
Similar, just requiring to replace corresponding variables, parameter and features.
In the M-step, the expectation is optimized towards a maximum by updating
parameter values. As we approximate the expectation using a Gibbs Sampling, it
is calculated by averaging the log-likelihood value across samples. Setting zero to
derivatives of the approximated expectation is equivalent to solve an OLS regression.
Let w be the vector of regression coefficients. Then, OLS is performed as:
w = (ηI + XTX)−1XTy (5.21)
Where Xi,· contains the features and xui contains the response, both from the
i-th example. The matrix with η values in the diagonal is added to ease the inversion
[Tang et al, 2013]. In the case of a vector such as ui, K regression problems are solved,
one for each dimension. OLS is easily generalized for the vector case by turning w into
a matrix W such that W·,k contains the regression coefficients of the k-th dimension
and y also becomes a matrix Y = [y1y2...yK ] of K response arrays.
However, parameters related to interaction variables fall under a different proce-
dure due to the sigmoid transformation of the linear regression. Such parameters are
5.7. Implementation Specifics 67
solved, instead, by Newton-Raphson method that approximates the root of a function
[Ypma, 1995]. As derivative’s root is a local optimum, the derivative is the main func-
tion. We also need the derivative of the main function, which is the second derivative
of the expectation in this case. Such formulas are presented in Table 5.8 and are similar
for both interaction parameters after performing the proper replacements.
The MCEM method applied for CAP is presented in Algorithm 2. Newton-
Raphson function receives two parameters, besides the function and its derivative:
max_iterN and αN , which represent the maximum number of iterations and the
learning rate for fitting the parameter, respectively. The algorithm complexity is
O (num_iter · |Ω|(num_samples ·K +max_iterN ·D3 +D ·K + |H0| ·K)), where
D is the total number of features. All terms have a linear contribution of the number
of iterations and the number of variables. Then, the first term corresponds to the
sampling cost, which is the highest for array variables. As covariances are ignored,
such cost is linear with K. The second term refers to Newton-Raphson cost. A Moore-
Penrose pseudo-inverse is computed for each iteration of the method, whose cost is
O(D3). Additionally, in each NR iteration, evaluating the function and its derivative
cost O(|Ω| · F ) and O(|Ω| · F 2, respectively, where the second is higher. The third
term is related to OLS and is represented by the cost of multiplying the matrix of
features with the response matrix, in the array case. Typically, NR will be higher than
OLS, but this depends on K. The fourth term corresponds to the update of σ2h, which
computes all predictions, each one with a cost of O(K) due to the dot product. Note
that, differently than BETF (the other RRS approach), CAP’s computational cost is
linearly related to K and all the other variables, only cubic with D.
In the original publication of CAP, only the Monte Carlo mean and variance of
αi were presented [Tang et al, 2013]. The remainder means and variances and the
derivatives of expectation with respect to r and h are all derived here. Additionally,
we have observed that CAP is an extension upon RLFM, a fact not previously stated.
5.7 Implementation Specifics
The implementation of most approaches considers available source codes: all regressors
– LR, SVR and GBRT – use Scikit-learn for Python [Pedregosa, 2011]; RSVM uses
an implementation built upon SVMLight5 [Joachims, 2006]; LambdaMART employs
RankLib6; and Regression-Based Latent Factor Models (RLFM) uses the authors’ avail-
5Support Vector Machine for Ranking. http://www.cs.cornell.edu/people/tj/svm_light/
svm_rank.html
6The Lemur Project – RankLib. http://sourceforge.net/p/lemur/wiki/RankLib
68 Chapter 5. Multiple-Paradigm Techniques
Algorithm 2: Algorithm of CAP Review Recommender System
Data: integer: K, num_samples, num_iter, max_iterN; real: η, αN ; sets:
V0,R0,A0,S0, C0,H0, {xDv×1u1...|V0 |}, {x
Dr×1
r1...|R0 |}, {x
Da×1
a1...|A0|
}, {xDs×1s1...|V0|,1...|A0|},
{xDc×1c1...|V0|,1...|A0|}
Result: utility function t
1 initialize Ω and Θ with small random values;
2 for i = 1 . . . num_iter do
// E-step
3 for s = 1 . . . num_samples do
4 foreach φ ∈ Ω do
5 φ(s) ← N (E(φ|L − {φ}), V ar(φ|L − {φ}));†
6 end
7 end
8 foreach φ ∈ Ω do
9 φ∗ ← mean({φ(1), . . . , φ(num_samples)});
10 φ∗var ← empiric_var({φ(1), . . . , φ(num_samples)});∓
11 end
// M-step
12 foreach θ ∈ Θ do
13 Φ← variables_associated_to(θ);
14 Y ← row_bind([φ∗for each φ ∈ Φ]);
15 X← row_bind([features_of(φ) for each φ ∈ Φ]);
16 if θ ∈ {r, h} then
17 θ ← newton_raphson(iter = num_iterN , learning_rate = αN , function =
∂E
∂θ (Φ), derivative =
∂2E
∂θ2
(Φ))
18 else
19 θ ← (ηI + XTX)−1XTy;
20 end
21 σ2 ← variance_associated_to(θ);
22 Σ∗ ← [φ∗varfor each φ ∈ Φ];
23 n← number_of_rows(X);
24 if θ ∈ {W,V} then
25 RSS ←∑ni=1∑Kj=1(θTXi,· −Yij)2;
26 σ2 ← RSS+
∑n
i=1
∑K
j=1 Σ
∗
ij
kn ;
27 else
28 RSS ←∑ni=1(θtXi,· − xui)2;
29 σ2 ← RSS+
∑n
i=1 Σ
∗
i
n ;
30 end
31 RSSh ←
∑
h∈H0(ph − yh)2;
32 σ2h ←
RSS+
∑
h∈H0 empiric_var({p(1)h ,...,p
(num_samples)
h }
|H0| ;
33 end
34 end
35 return τ(ui, rj) = uTi vj + αi + βj + ξk + δ1(i, k)γki + δ2(i, k)λki ;‡
†If φ is a vector, then a multivariate normal distribution is used (MVN ).
∓ The empiric variance for the vector case is computed as the empiric variance of each dimension
independently. Covariances are ignored.
‡When a variable φ is undefined for the input entity, it is replaced by the regression on features
θTx, where θ is the weight parameter and x the feature vector associated. Also, ak is such that
∃p : pir(rj) = (p, ak).
5.7. Implementation Specifics 69
able source code7. The other algorithms – Matrix Factorization and the two methods
in RRS (BETF and CAP) – were implemented from scratch and are publicly available8.
Implementation details are shown next.
Preprocessing. All parameter values are scaled within [0, 1], using maximum
and minimum values in training set. We also perform mean imputation for feature
values that were undefined in training set, such as average rating of a user with no
written reviews and average helpfulness of a user who gave no vote.
Initial Values. Setting proper initial value is essential for good performance,
both for latent variables and parameters, so we empirically determine the most suitable
range of values. The best result for BETF occurs with initial values randomly in the
range (0, 1), since it applies logistic transformation. For CAP and MF, a smaller range
is required for convergence and we use a random sample from (10−10, 10−8).
Bias Baseline. We account for bias as a preprocessing step for all techniques by
subtracting the portion of the value explained by such bias baseline predictor. Then,
the bias prediction is added back to the methods’ prediction. We do not include
bias for those techniques that already acknowledge it inside the model, specifically
MF, RLFM, BETF and CAP. Biases are computed in a preprocessing step by using
stochastic gradient descent method. Then, only the portion of helpfulness votes not
explained by bias baseline prediction is fitted in each technique. Since our design
has all reviews as cold-starts (because of chronological split, as discussed in Section
4.2), we replace review bias to a sum of author and product biases, inspired by BETF
decomposition of review dimension into author and product [Moghaddam et al., 2012].
Cold-start Prediction. For collaborative filtering techniques, without any rule
for cold-start prediction, we use the overall mean when a cold-start is found.
Matrix Factorization (MF). Since all reviews are cold-start, we model author
of reviews as items under evaluation. Previous studies validated the Author Con-
sistency Hypothesis [Lu et al, 2010] and we provide more evidence in Section 4.5.
Additionally, we implement MF with the possibility of jointly bias fitting controlled by
a user-defined parameter.
UnBiased Extended Tensor Factorization (BETF). As all reviews are cold-
start in our design, we replaced review bias by a sum of biases from product and author,
following a similar pattern of review dimension replacement conducted in this method.
7Research Code for LFM. http://github.com/yahoo/Latent-Factor-Models
8Review Recom. Repository. http://github.com/lucianamaroun/review_recommendation
70 Chapter 5. Multiple-Paradigm Techniques
Table 5.9: Summary of the techniques considered.
Method Advantages Disadvantages Calculus of τ Fitting
OM Very simple andvery fast
Not for ranking
(constant), not
personalized
Overall mean in the
training set
Arithmetic mean of
all votes
AM Simple and fast,works for ranking Not personalized
Authors’ mean in
the training set
Arithmetic mean of
each author’s votes
VM Simple and fast,personalized
Not for ranking
(constant)
Voters’ mean in the
training set
Arithmetic mean of
each voter’s votes
LR Simple and fast
Over restrictive for
ranking, only linear
relation
Linear combination
of features plus
intercept
Cholesky, SVD, or
iterative
optimization1
SVR
Robust to noise,
non-linear model
through kernel
More costly, suffer
with high
dimensionality
Linear combination
of features plus
intercept
QP2 on dual, or
iterative
optimization
GBRT
Unrestricted
non-linear model,
good for ranking in
general
Tendency to overfit
due to unrestricted
format
Weighted sum of
decision trees
Stage-wise fitting of
decision trees on
pseudoresponses
RSVM
Ranking goal,
non-linear model
through kernel
Cost of processing
all pairs
Linear combination
of features
QP on dual, or
iterative
optimization
LMART
Ranking goal,
unrestricted
non-linear model
Tendency to overfit
due to unrestricted
format
Weighted sum of
decision trees’
outputs
Stage-wise fitting of
decision trees on
lambda variables.
MF Simple, no need offeatures
Not able to predict
for complete
cold-starts
Inner product of
user and item latent
factors plus bias
variables and mean
SVD, or iterative
optimization
RLFM Predicts for coldand warm-starts
Need enough
features, costly to
sample
Sum of regression
and latent variables MCEM
BETF
No need of features,
specialized for
review
recommendation
Not able to predict
for cold-starts
Sum of mean, bias
variables and tensor
product of latent
factors
Iterative
optimization
CAP
Predicts for cold
and warm-starts,
specialized for
review
recommendation
Need enough
features, costly to
sample and
compute NR
Sum of latent
variables MCEM
1Any procedure, such as stochastic gradient descent.
2Quadratic Programming.
5.8. Concluding Remarks 71
5.8 Concluding Remarks
In this chapter, we defined a set of techniques from five classes of increasing specializa-
tion included in our experimental comparison. We described the overall characteristics
of each class and how the original input is converted in order to be applied under
the specific paradigm. We also presented the behavior of each technique, especially in
terms of how the utility function is computed and the optimization objective, when it
applies.
Table 5.9 summarizes of all methods considered in this work. We briefly discourse
the advantages and disadvantages of each method, as well as the computation of the
utility function and the fitting procedure.
As one of our goals is to understand how specialized solutions compare with
other methods, we presented RRS approaches – BETF and CAP – in deep. Their
original papers [Moghaddam et al., 2012; Tang et al, 2013] lack details required for
implementation and analysis, such as precise algorithm and formulas, which we fully
presented in this work. Despite being the state-of-the-art for review recommendation,
they have not yet been compared in all of the following scenarios: (i) against each
other; (ii) with statistical significance against other techniques; (iii) under a ranking
perspective; and (iv) with the whole set of techniques presented here, all of them in a
personalized setting (except mean-based predictors OM and AM). In the next section,
we perform such comparisons, deriving the best overall approach and key aspects for
an effective review recommender system.

Chapter 6
Experimental Analysis
In this chapter, we compare the considered techniques under a unified statistical design
of experiments. Such design consists of evaluating all techniques after conducting
parameter tuning, with both ranking and regression metrics and statistical significance
(Section 6.1). We start the analysis by investigating the impact of parametrization
(Section 6.2). Then, we compare all the solutions (Section 6.3) and draw some insights
regarding review recommendation problem (Section 6.4). In the following, we delve
deeper into the SVR predictor, the best performer, and explore extensions to improve
this method for the defined task (Section 6.5). Finally, we present concluding remarks
of our experimental analysis (Section 6.6).
6.1 Experimental Design
Given the large number of considered techniques, we choose to first group them by their
class (e.g., REG, GRS), then evaluate different methods within the classes, and finally
compare the best performer of each class. The experimental design is performed with:
(i) optimized parameters, by evaluating performance in a validation set; (ii) ranking
and regression metrics; and (iii) statistical significance, by considering variability.
With the purpose of assessing the effectiveness of existing methods under different
scenarios, we execute them for different data splits. Five splits of the dataset are de-
fined, with sliding windows over votes disposed chronologically. This allows to account
for contextual variability, which consists of different behaviors of a technique due to
distinct training and test sets. We also consider the stochastic nature of techniques by
performing multiple runs for each split; then, we derive the average result for a split.
The splitting scheme works as follows: a window size is fixed with 60% of votes,
shifting 10% of votes each time. Such size represents a compromise between overlap of
73
74 Chapter 6. Experimental Analysis
Reviews
Users
Products
Group
by user
(3) User Modeling
Extract 
reviews’ 
features
(2) Review Modeling
(4) Interaction Modeling
The Dataset
Votes Trust 
Network
Extract 
interaction 
features
Extract
users’ 
features
(1) Data Splitting
Sort in Time
Split in Time 
Windows
i
Fo
r e
ac
h s
pli
t
trai
ning
trai
ning
testvali
dat
ion
1 2 3 4
5
trai
ningtime
time
Figure 6.1: Preprocessing step for modeling the data to be used by the review rec-
ommender techniques. Votes are divided into splits and, for each split, features are
extracted from reviews, users and interactions according to the training set available.
windows and amount of data for learning and evaluation. For each window, the first
40% represent the training set; the next 10%, the validation set; and the last 50%, the
test set. Figure 6.1 shows the chronological split and the preprocessing step composing
features from the training data for reviews, users and author-voter interactions. Only
hybrid and content-based recommenders use such features, but all of them use the
corresponding split’s training set.
We use a larger test than training set because the former was further reduced
by eliminating votes from voter-product pairs comprising less than 5 members. When
dividing between training, validation and test sets, the rankings (which have originally
at least 10 reviews) may have less than 5 in the test set. This filtering procedure is
required for evaluating the methods using a ranking metric over the top 5 reviews. The
final sizes of test sets are around 38% for all splits, very close to the training size.
We use a chronological opposed to a random split of the data for building training
and test sets, differently of many prior studies [Lu et al, 2010; Moghaddam et al., 2012;
Zhang and Tran, 2011]. A chronological split1 regards temporal constraints when
building the predictive model. Specifically, by doing such split, we avoid using future
1A solution may consider the hypothesis that users would give the same helpfulness vote regardless of
the current time. Even so, if features that change over time are used, such as the number of written
reviews, then a chronological split is required.
6.1. Experimental Design 75
information to build the predictor (e.g.: feature values computed over data observed
after the prediction time), which is unachievable in a real scenario, as already discussed
in Section 4.2.
For every algorithm’s evaluation, we compute the confidence interval (CI) with
95% of confidence (or likewise 5% of significance) relative to the results on different
training-test splits. We compare any two solutions using a paired t-test [Jain, 1991],
with the same confidence. In this test, the difference of scores in each split is calculated.
Then, a hypothesis test is performed such that the null hypothesis states that the
mean of differences is zero. If the such hypothesis is rejected with the desired level of
confidence, the methods are significantly different. For such tests, we report the lowest
possible significance, the p-value. As we adopt a 5% significance, p-values of 0.05 or
lower yield significant results. To apply this test, we assume that performance results
for a technique are normally distributed.
When performing parameter tuning, we evaluate different values for all parame-
ters of LR, SVR, MF, RLFM, BETF and CAP, considering their original format. For
the other methods, we focus on those that, reportedly, have larger impact on the re-
sults, leaving the others at their default values2. Note that methods in MBP class have
no parameter.
The overall prediction process is represented in Figure 6.2. Initially, the specific
input is generated as algorithms may differ in what data they expect. In case of a
collaborative filtering approach, a user-item matrix is built without feature vectors,
just using the identifiers of entities (users and reviews, possibly authors and products
also) in the training set. For content-based approaches, the feature vectors of reviews,
users and interactions are concatenated when belonging to the same vote.
After processing the input, the utility function is learned and applied to unseen
data for each repetition. Validation and test sets follow the same format in the pre-
diction process. The only difference lies in the purpose: while validation is intended
for optimizing parameters, the performance on test set corresponds to the final perfor-
mance measure of the algorithm and is not used for anything else.
In our evaluation, we consider three metrics: Root Mean Squared Error (RMSE),
normalized Discounted Cumulative Gain at position p (nDCG@p) and Expected Re-
ciprocal Rank (ERR). RMSE, a regression metric, is defined as follows:
RMSE(H′) =
√
1
|H′|
∑
h∈H′
(yh − ph)2, (6.1)
2We did not evaluate different kernels for RSVM due to excessive computational cost.
76 Chapter 6. Experimental Analysis
Vectorize
(1) Input Transformation
Fill a 
user-item 
matrix
OR
(2) Learn a Predictor
Fit utility 
function 
using 
training set
(3) Predict for New Votes
Apply 
utility 
function for 
each 
instance
OR
OR
Reviews
Users
Products
The Dataset
Votes Trust 
Network
i
For e
ach s
plit
OR
Fo
r e
ac
h r
ep
eti
tio
n
trai
ning
Figure 6.2: Prediction step for any algorithm. The input is composed by a set of
feature vectors or user-item matrix. Then, the utility function is built and applied to
unseen instances, in validation or test sets.
where H′ ∈ H is a set of votes whose prediction is under evaluation. Such metric is
used in all previous evaluations of specialized methods [Moghaddam et al., 2012; Tang
et al, 2013] and is a standard metric for recommender systems [Koren, 2008]. We use
it mainly to contrast results under previous assumptions of review recommendation
as a rating prediction task. However, the purpose of most recommendation tasks
is to compose a ranking of items to a user; so, a ranking metric is more adequate
[Balakrishnan and Chopra, 2012] and we also use nDCG@p, defined as:
DCG@p(Ou,p) =
p∑
i=1
2yu,ri − 1
log2(i+ 1)
,
nDCG@p(Ou,p) = DCG@p(Ou,p)
DCG@p(O∗u,p)
,
(6.2)
where Ou,p is a ranking of reviews for user u and product p, yu,ri is the true vote given
by user u to the review ranked at position i in Ou,p, and O∗u,p is the ranking of reviews
about product p sorted by decreasing true votes by user u (i.e., yu,ri), which is the best
ranking possible. As nDCG@p is defined for a single ranking, our final score is the
average nDCG@p of a set of rankings. This metric is beneficial due to its unrestricted
6.1. Experimental Design 77
range of relevance values and its exponential decay, severely penalizing the score for
lower positions [Anderson, 2014]. We apply such metric for p = 1 . . . 5, with a focus on
p = 5, to address the cost of reading reviews.
Another ranking measure for a scale of relevance is Expected Reciprocal Rank
(ERR) [Chapelle et al., 2009]. It considers user navigational behavior through a cas-
cading model. For each element in a ranking, there is an associated probability Pk of
a review at position k completely satisfying user needs:
Pk =
2yk − 1
2ymax
, (6.3)
where yk is the helpfulness of a review at position k, ymax is the maximum possible
relevance (5, in our case), and we use the gain function for estimating probabilities as
performed in the original paper [Chapelle et al., 2009]. Thus, the ERR is defined as:
ERR(Ou,p) =
|Ou,p|∑
i=1
1
i
Pi
i−1∏
j=1
(1− Pj), (6.4)
which is the sum of the reciprocal rank weighted by the probability of the user stopping
at position i, i = 1 . . . |Ou,p, thus being an expectation. Using a cascade model, each
probability of stopping at one element is independent, thus allowing the product of
probabilities to obtain the joint distribution [Chapelle et al., 2009]. Thus, the event of
a user stopping at position i requires that he did not stop at positions from 1 to i− 1.
We incorporate ERR to double check the performance under a ranking perspective,
allowing to see if both metrics agree or not. As an advantage, ERR has a smooth
stopping criterion, determined by a user probabilistic model. Similarly to nDCG@p,
we evaluate several rankings by computing the mean.
We preferred not to use a binary relevance evaluation (for instance, Mean Average
Precision at p – MAP@p) as such practice requires to divide votes into helpful or not
helpful, then implying in information loss. Also, it is hard to explicitly state the cut
point between the two classes: it may vary across users or it may be interpreted wrongly
due to insufficient amount of training data. Finally, in settings where a very helpful
review does not exists, it is still important to deliver the most relevant ones among the
available set. For example, when no review with helpfulness 5 or 4 exists, differing 3
from the rest is still important.
Finally, the goal of a review recommender is to generate a total ordering of reviews
available for each user-product pair. However, the dataset does not contain all reviews
of a product with helpfulness votes from all users. Under this restriction, we only
78 Chapter 6. Experimental Analysis
Table 6.1: List of parameters with a description, evaluated range or set, and the best
value for ranking (nDCG@5*) and regression (RMSE*).
Method Symbol Description Range/Set nDCG@5* RMSE*
LR β Regularization factor (10
−6, 106) 103 103
bias Addition of bias {true, false} false false
SVR
C Penalty factor (10−4, 10) 10−3 1
kernel Kernel function {linear, poly2, poly3, rbf, sig} linear linear
ε Addition of bias (10−4, 10) 10−1 10−1
bias Addition of bias {true, false} false false
GBRT
α Learning rate (10−4, 1) 10−1 10−1
T # Trees (10, 103) 102 32
D Depth of trees (1, 5) 4 2
loss Loss function {ls, lad, huber, 0.5/0.9-quant} ls ls
S Subsample by tree (0.2, 1) 1 0.6
F Max. features by tree (10, 43) 43 10
bias Addition of bias {true, false} false false
RSVM C Penalty factor (10
−3, 10) 10−2 −
bias Addition of bias {true, false} true −
LMART
α Learning rate (10−5, 10−1) 10−5 −
T # Trees (10, 103) 10 −
L # Tree leaves (5, 25) 5 −
bias Addition of bias {true, false} true −
MF
K # Latent dimensions (5, 50) 25 50
I # Iterations (10, 50) 10 50
tol Convergence tolerance (10−6, 10−2) 10−4 10−4
α Learning rate (10−5, 1) 10−1 10−1
β Regularization factor (10−4, 1) 10−4 10−1
bias Addition of bias {true, false} true true
RLFM
K # Latent dimensions (5, 50) 15 25
I # Iterations (5, 25) 20 25
G # Gibbs samples (10, 103) 30 30
B # Burn-in samples (0, 20) 0 0
BETF
K # Latent dimensions (3, 20) 4 4
I # Iterations (5, 25) 15 20
tol Convergence tolerance (10−6, 10−2) 10−4 10−6
α Initial learning rate (10−4, 10) 10−3 1
β Regularization factor (10−4, 10) 1 10−3
CAP
K # Latent dimensions (3, 25) 5 8
I # Iterations (5, 25) 5 10
G # Gibbs samples (10, 102) 50 20
IN # NR iterations (10, 50) 10 10
α NR learning rate (10−5, 10−1) 10−2 10−5
η OLS constant (10−6, 10−2) 10−6 10−5
evaluate ordering of reviews with available votes for each user-product pair. This is a
subset of the intended ranking, but whose relative positioning shall be equally good.
6.2 Parameter Tuning
The best parameter value is the one that minimizes either nDCG@5 or RMSE, accord-
ing to the final goal. As we evaluate any method with both metrics to contrast them,
we performed both tunings. Only LTR approaches are left out of RMSE evaluation
6.2. Parameter Tuning 79
as they are not appropriate for regression: they predict relevance scores in any range,
not necessarily the same of helpfulness. Table 6.1 lists all parameters evaluated, with
a description, the range or set of values evaluated and the best value for nDCG@5 and
RMSE. Each range has at least five values explored such that wide ranges (with more
than two orders of magnitude) are explored exponentially (multiplying the previous
value by a factor) and the remainder with equally spaced values.
Observing the parameters of REG class regarding nDCG@5, there is a tendency
of best performance under simpler models: low C = 10−2 and kernel = linear for
SVR, and high regularization β = 103 for LR. Additionally, in any method from this
class, accounting for bias leads to worse performance. For GBRT, we note that the
best loss function is least squares (ls), which penalizes high deviations more.
The optimal value for bias in LTR class is true, in opposite to REG. Better
average performance for smaller values of T and L for LMART indicates that simpler
models, with fewer rules, are more efficient.
For MF, jointly updating bias as additional variables is a good strategy. For both
GRS methods, more iterations does not represent an improvement in accuracy. The
same occurs for the number of Gibbs samples. No burn-in sample, disregarded in the
beginning of sampling, was better than any. In general, such values provide a more
stochastic model and less prone to overfitting.
Finally, the best parameters of techniques in RRS class also indicate that the
simpler the model the better: we observe low to moderate values for K, I, G, and IN .
Moreover, a certain high regularization β = 1 is the best for BETF.
The distinction among optimal values for nDCG@5 and RMSE is an evidence of
their different purposes. In general, when using RMSE criterion, more complex models
are built, given by higher K, more iterations I, higher C, lower regularization β.
An additional detail regarding ranking and regression metrics is the optimal num-
ber of iterations. In iterative optimization – which is the case of MF, RLFM, BETF and
CAP – a large number of iterations and a lower convergence tolerance do not always
imply a better performance for nDCG@5. The reason for such unexpected behavior
lies in different fitting and evaluation criteria. Here, the fitting goal of the enumerated
techniques is minimizing a regression error, and performance is evaluated by consider-
ing a ranking metric. More iterations and lower convergence tolerance typically reduce
the former, but not necessarily the latter as they are uncorrelated metrics. Even for a
similar fitting and evaluation metric, a lower error in training does not always imply
the same for test, as optimal RMSE parameters show, although they tend to be higher.
To our knowledge, such study of parametrization for review recommendation is
new. Previous works including any of these methods for the same problem did not
80 Chapter 6. Experimental Analysis
Table 6.2: Evaluation of techniques for the three metrics considered. Parameters used
for RMSE are different from those of ERR and nDCG@5. For ERR and nDCG@5, the
higher the better and for RMSE, the lower the better.
Technique nDCG@5 ERR RMSE
OM 0.8808± 0.0265 0.6616± 0.0032 0.4148± 0.0489
AM 0.8944± 0.0278 0.6762± 0.0038 0.4394± 0.0624
VM 0.8808± 0.0265 0.6616± 0.0032 0.4323± 0.0681
LR 0.9269± 0.0178 0.6885± 0.0077 0.3916± 0.0463∗
SVR 0.9362± 0.0132* 0.6983± 0.0121* 0.3997± 0.0524∗
GBRT 0.9233± 0.0185 0.6857± 0.0060 0.3621± 0.0354∗
RSVM 0.8790± 0.0279 0.6608± 0.0024 −
LMART 0.8929± 0.0277 0.6726± 0.0036 −
MF 0.8886± 0.0286 0.6703± 0.0023 0.4370± 0.0621
RLFM 0.9229± 0.0197 0.6843± 0.0053 0.4167± 0.0235
BETF 0.8869± 0.0281 0.6674± 0.0023 0.4161± 0.0509
CAP 0.9213± 0.0196 0.6846± 0.0060 0.4354± 0.0462
address such a parametric analysis [Moghaddam et al., 2012; Tang et al, 2013]. In fact,
Tang et al [2013] only state that parameters were defined using cross-validation, and
Moghaddam et al. [2012] perform sensitivity analysis of the number of latent dimensions
K3 [Moghaddam et al., 2012]. Furthermore, both works consider just RMSE metric.
6.3 Comparison of Techniques
We compare all techniques within their classes for nDCG@p, and then the best one of
each class to identify the method with highest performance. As RMSE is widely used
in the literature, we also quantify this metric using parameter settings optimized for it,
except for methods in LTR. Table 6.2 presents average results with the corresponding
95% confidence intervals. The best of each class for a given metric is presented in bold,
along with statistical ties. The overall best for a metric, together with statistical ties,
has an asterisk suffix. Almost all ICs overlap; however, we compare any two methods
using a paired t-test, much more powerful than an unpaired test and yielding many
significant conclusions.
Table 6.3 shows the p-values of paired t-tests within classes for the performance
of one method being significantly different than the other (thus, the null hypothesis
states the equality of means for two methods). The better one is given by the higher
3Using what it seems to be the performance in test set.
6.3. Comparison of Techniques 81
Table 6.3: Result of paired t-test represented by p-values for each class regarding all
metrics. The null hypothesis is represented by a pair of methods whose mean is tested
on equality.
p-value
Hypothesis nDCG@5 ERR RMSE
OM = AM < 0.0001 < 0.0001 0.0107
OM = VM < 0.0001 < 0.0001 0.1188
AM = VM < 0.0001 < 0.0001 0.2659
LR = SVR 0.0067 0.0054 0.0883
LR = GBRT 0.0104 0.0243 0.1847
SVR = GBRT 0.0040 0.0050 0.1064
RSVM = LMART < 0.0001 < 0.0001 −
RLFM = MF 0.0005 0.0005 0.3095
CAP = BETF 0.0004 0.0003 0.0407
average performance in Table 6.2. We note that, in such tests, the same conclusions
are obtained for nDCG@5 and ERR, while RMSE differs.
OM and VM do not to generate meaningful rankings of reviews. Their predictions
are constant for a reader-product pair, thus, for all reviews in the same ranking. This
is caused by OM predicting a constant value for all reviews and VM predicting a
constant value to the same voter, both unique values for a given ranking. Thus, they
do not change the ordering of reviews, composing a worst-case solution. On the other
hand, AM provides different predictions for a ranking and performs significantly better.
While OM and VM are not suited for ranking, they perform well for regression. OM, the
simplest one, gives the best result for RMSE, statistically tied with VM and significantly
outperforming AM with 95% of confidence.
All three methods in the REG class produce statistically different average of
nDCG@5. Regarding RMSE, they are not statistically different though. Also, for
nDCG@5, SVR is the best followed by LR and GBRT. For RMSE, when comparing
average results, such an order is inverted.
For the remainder classes, LambdaMART has significantly the best performance
for both ranking metrics within LTR class. In GRS, RLFM is significantly better;
yet, when using RMSE, both techniques are statistically tied. Finally, in RRS class,
CAP significantly outperforms BETF for nDCG@5 and ERR, whereas BETF is better
considering RMSE. This illustrates the difference between ranking and regression goals.
Also, collaborative filtering may work for the latter, but not quite for the former.
82 Chapter 6. Experimental Analysis
Table 6.4: Obtained p-values from paired t-test of best of classes for nDCG@5 by
comparing a technique defined in the column with the one in the row.
LR RLFM CAP LMART AM
SVR 0.0067 0.0053 0.0037 0.0012 0.0014
LR 0.0046 0.0013 0.0008 0.0010
RLFM 0.0219 0.0006 0.0008
CAP 0.0007 0.0010
LMART 0.0010
Table 6.5: Obtained p-values from paired t-test of best of classes for ERR by com-
paring a technique defined in the column with the one in the row.
LR RLFM CAP LMART AM
SVR 0.0054 0.0051 0.0038 0.0013 0.0022
LR 0.0102 0.0040 0.00063 0.0022
RLFM 0.4824 0.0002 0.0017
CAP 0.0003 0.0017
LMART 0.0033
Table 6.6: Obtained p-values from paired t-test of best of classes for RMSE by
comparing a technique defined in the column with the one in the row.
OM RLFM BETF
GBRT 0.0210 0.0125 0.0227
OM 0.8994 0.1940
RLFM 0.9696
The p-values obtained from paired t-test of the best methods using nDCG@5,
ERR and RMSE are shown in Tables 6.4, 6.5 and 6.6, respectively. SVR is significantly
the overall best for both ranking metrics, followed by LR and both are statistically
above all the others. RLFM significantly outperforms CAP using nDCG@5, but they
are equivalent when considering ERR. LMART is worse than AM for both ranking
metrics. The comparison is much more difficult regarding RMSE. In fact, GBRT
outperforms all the other classes’ representatives, which are statistically tied. The
good performance of OM shows how such metric is vulnerable to high performance of
naive predictors in the case of highly unbalanced distributions of responses.
6.3. Comparison of Techniques 83
0.77
0.81
0.85
0.89
0.93
1 2 3 4 5
p
n
D
CG
@
p
AM
OM, VM
(a) MBP
l
l
l
l
l
l
l
l
l
l
0.83
0.86
0.89
0.92
0.95
1 2 3 4 5
p
n
D
CG
@
p
l
l
SVR
LR
GBRT
(b) REG
l
l
l
l
l
0.80
0.83
0.86
0.89
1 2 3 4 5
p
n
D
CG
@
p
l
LMART
RSVM
(c) LTR
0.79
0.82
0.85
0.88
0.91
0.94
1 2 3 4 5
p
n
D
CG
@
p
RLFM
MF
(d) GRS
l
l
l
l
l
0.78
0.81
0.84
0.87
0.90
0.93
1 2 3 4 5
p
n
D
CG
@
p
l
CAP
BETF
(e) RRS
l
l
l
l
l
l
l
l
l
l
0.80
0.83
0.86
0.89
0.92
0.95
1 2 3 4 5
p
n
D
CG
@
p
l
l
SVR
LR
RLFM
CAP
AM
LMART
(f) nDCG@{1,. . . ,5} for Best
Figure 6.3: Comparison of techniques within classes in Figures (a)-(e) and of best
of each class in Figure (f) using nDCG@p. The set of best algorithms includes LR, as
it stands between SVR and the other classes in performance.
G
BR
T LR
SV
R
O
M
BE
TF
R
LF
M VM CA
P
M
F
AM
R
M
SE
0.0
0.1
0.2
0.3
0.4
0.5
0.6
MBP
REG
GRS
RRS
Figure 6.4: Comparison of all techniques but LTR regarding RMSE.
84 Chapter 6. Experimental Analysis
l
l
l
l
l
l
l
l
l
l
0.92
0.93
0.94
0.95
0.96
0.97
0.98
1 2 3 4 5
p
n
D
CG
@
p
l
l
RLFM
CAP
SVR
LR
AM
LMART
(a) Random split
l
l
l
l
l
0.92
0.93
0.94
0.95
0.96
0.97
0.98
1 2 3 4 5
p
n
D
CG
@
p
l
RLFM
BETF
MF
(b) Random split for CF
Figure 6.5: Comparison of techniques with nDCG@p using random splits. The best
of each class is in Figure (a) and the best performer, RLFM, and collaborative filtering
(CF) methods are in Figure (b). RLFM is significantly the best.
Figures 6.3a to 6.3e show the average performance for classes using nDCG@p. The
relative order of methods is consistent across values of p. 6.3f contrasts the nDCG@p
of the best technique of each class, together with LR. Such method is the second best
in REG, but above all the others, while comprising a very basic approach. Similarly,
the performance is consistent with p, only CAP and RLFM are swapped by little for
nDCG@1. Figure 6.4 presents visually the performance regarding RMSE.
We also compare the techniques by considering random splits of training and test
sets, shown in Figure 6.5. Even though such split is often employed by previous works,
we argue that it is not a valid approach, as discussed in Section 4.2. Figure 6.5a shows
that this setting may lead to different conclusions: under a favorable context, with low
rates of cold-start reviews, SVR is not the best approach, but RLFM.
As BETF and MF, both collaborative filtering approaches, are penalized by high
rate of cold-starts, we also investigate their performance for such random splits. Al-
though they improve in performance, they are still not the best approaches. Moreover,
the performance of all methods are somewhat inflated by using random splits when
compared to the results obtained with a chronological split, though not proportionally.
Such results serve to show the importance of considering time constraints for a solid
evaluation of recommendation methods.
Table 6.7: Ranking of best of classes according to each metric.
nDCG@5 SVR  LR  RLFM  CAP  AM  LMART
ERR SVR  LR  RLFM = CAP  AM  LMART
RMSE GBRT  OM = RLFM = BETF
6.4. Discussion of Results 85
Finally, Table 6.7 shows the ranking of classes representatives for each metric.
We note that nDCG@5 and RMSE are uncorrelated metrics, as the relative perfor-
mance of the methods change drastically among both. For example, unlike observed
for nDCG@5, when RMSE is used as evaluation metric, we find that GBRT is the best
method, while even naive MBP techniques have competitive performance. ERR and
nDCG@5 tend to follow a same pattern, although there is a small variation, specifically
whether RLFM and CAP are statistically different or not. Overall, nDCG@5 and ERR
are very consistent, demonstrating that the ranking evaluation performed is cohesive.
6.4 Discussion of Results
We divide our discussion on the presented results into four main competitive properties
of the techniques: ranking or regression objective, usage of observed or latent features,
simple or sophisticated nature and absent or present bias. Then, we contrast our
conclusions with previous results addressed in specialized baseline experiments.
6.4.1 Ranking versus Regression Goal
Although review recommendation is intrinsically a ranking problem, our results reveal
that LTR methods are not the best performers for such task; their results are at best
comparable to those produced by AM. They hardly learn good review orderings under
sparse scenarios, i.e., when there are few evaluations of reviews for reader-product pairs.
Also, a small portion of the features are related to user and review, which corresponds
to query and document interaction, while such features are the most important for
learning to rank [Chapelle and Chang, 2011]. In general, a few features penalize severely
such learning process; for instance, in Yahoo Learning to Rank Challenge, more than
400 features are used.
Yet, probably the main problem with learning to rank solutions refers to the
enormous amount of ties in pairs of reviews. In average, only 11.5% of all pairs in
a given split have different relevance values for the filtered dataset used. Thus, from
all data in training set, pairwise techniques, such as RankSVM, only use around one
tenth as they disregarded tied pairs. Despite being listwise, LambdaMART compute
the change in a performance metric by swapping pairs, but such action typically do
not change the metric. Learning from a very small subset of the training set makes it
difficult to derive a good predictor.
In spite of using a ranking goal for learning being not satisfactory, it is still
important for evaluation as it is more closely related to the way results are presented
86 Chapter 6. Experimental Analysis
to the user. Indeed, ranking and regression are clearly uncorrelated problems for review
recommendation, which is notable by very different results and conclusions addressed
by nDCG@p and RMSE. Even naive predictors based on the mean perform well for
RMSE, then making difficult to distinguish the methods with statistical significance.
Since the ultimate goal is to provide an ordering of reviews to a reader, finding good
predictors using ranking performance is a priority.
6.4.2 Observed versus Latent Features
Clearly, collaborative filtering techniques, very popular in recommender systems, do
not perform as well for review recommendation. Sparseness is increased for reviews as
recommendable items, since their volume is typically greater and their rate of increase
is even higher. The cost of writing a review, indeed, is much lower than producing a
new product. When there are many cold-starts, using observed features is crucial.
In our design, all reviews are cold-start. Nonetheless, Tang et al [2013] measure
the rate of cold-starts in the original dataset, resulting in 44.72% in chronological
sorting4. Still, almost half of votes to predict may not rely in latent factors, being
mandatory to use features in order to overcome such sparseness. Additionally, Tang
et al [2013] describe that CAP method with only latent vector inner product term is
better than MF in warm-start case due to observed features, although they use RMSE.
Differently from MF, RLFM incorporates observed features into the model, then
allowing prediction for cold-starts. Such behavior also results in better performance
regarding nDCG@p. Therefore, considering that the review recommendation scenario
is highly sparse, observed features are very important to complement historical behav-
ior. This is also noted by the performance of BETF, a purely collaborative filtering
approach that suffers severely from cold-starts. This method is outperformed by CAP
using nDCG@p and ERR, although better in terms of RMSE. This suggests that ob-
served features are especially important for ranking-oriented recommendation.
6.4.3 Simple versus Sophisticated Predictive Model
Author’s reputation reveals important by AM performance for ranking, surpassing
LMART. A very simple model may capture great part of observed values. Therefore,
replacing the review as recommended item by the author seems to be a reasonable
approach. Indeed, MF approximates the RMSE of RLFM by applying such a concept,
being not significantly different.
4Tang et al [2013] use the same data and report a chronological split using vote timestamps. However,
such information is not available in the dataset made publicly available.
6.4. Discussion of Results 87
Based on the experimental results, we observe that simpler models are usually
more efficient than sophisticated ones, as suggested by high regularization factors, low
penalty factors, few iterations for optimization and so on. Also, intrinsic simple models
presented reasonably good performance, such as LR. Overall, a better generalization
pattern is encountered in such setting and, then, predictors suffer less from overfitting.
In fact, a linear model of features is as good as a kernel transformation for SVR.
Complex models from more generic classes, such as GBRT, RSVM and LMART,
are behind in performance than simpler models. RLFM and CAP are the more special-
ized ones with better performance; they are better than their counterparts especially
by using observed features. However, such capability is not enough to surpass SVR us-
ing nDCG@5, nor GBRT in terms of RMSE. Thereby, the specialized methods, BETF
and CAP, are incapable of overcoming simpler methods in both metrics.
6.4.4 Absent versus Present Bias
Our results indicate that adding a bias factor, which typically improves recommenda-
tion results, is not a good practice for Regressors (REG), unlike for Learning-to-Rank
(LTR) and General-Purpose Recommender Systems (GRS). This is likely due to the
presence of many cold-starts, causing bias distributions in training and test to be dis-
crepant. Then, test performance is degenerated, especially for absolute predictions.
On the other hand, for relative prediction in LTR techniques, bias improves the
results. In this case, the model learned by removing bias effect and focusing on interac-
tion overcame the discrepancies of biases in training and test sets. Additionally, voter
average helpfulness is a feature that may replace bias in REG, but not in learning to
rank as such feature is constant for a query. Finally, bias may improve performance
for a predictor relying on few training samples (such as LTR does due to many tied
pairs) by guessing right in some cases. But when added to an already robust model,
the performance drops.
6.4.5 Contrasting with Previous Experiments
BETF was also compared with LR, SVR and MF in the original proposal paper,
whose conclusion pointed towards the best performance of the first in terms of RMSE
[Moghaddam et al., 2012]; yet, we conclude the opposite, except for MF. Our exper-
iments are different due to four reasons: (i) we consider LR and SVR not only with
non-personalized features, but also personalized; (ii) we evaluate using a chronological
88 Chapter 6. Experimental Analysis
split while the authors used a random and in a different dataset5; (iii) we use around
the same amount of data for training and test, while the authors used 90% for train-
ing and the remainder for test; (iv) we compare solutions through a statistical test.
Especially the larger training set benefits BETF as less cold starts occurs in test set.
We conclude from the results that, regarding RMSE, BETF is indeed better than
MF (p-value equal to 0.0117); thus tensor modeling and simultaneously fitting ratings
indeed improve over plain MF. However, SVR and LR are statistically better regarding
RMSE (p-values equal to 0.0002 and 0.0010, respectively). Regarding ranking metrics,
MF, LR and SVR have statistically higher performance than BETF under paired t-tests
(the p-values are 0.0029, 0.0008 and 0.0005, respectively). MF is better for ranking
probably due to its simplicity, as we note by several indication of better performance
of simpler models for ranking. In our experiments, BETF suffers too much from cold-
start, being the probable reason for being surpassed by SVR and LR in all evaluations.
Even though in the original paper [Zhang and Tran, 2011] CAP is compared with
LR, only non-personalized features were used for such baseline. In their report, CAP
outperformed LR in terms of RMSE. However, they did not address the statistical sig-
nificance, besides neglecting the potential of using personalized features for LR. In our
experiments, we found that LR is better both for ranking and regression, significantly
through paired t-tests (p-values equal to 0.0013 and 0.0004, respectively).
CAP was previously claimed to be better than MF [Tang et al, 2013] in a work
using the same dataset from Ciao UK, but without the filtering (ignoring invalid in-
stances and requiring rankings with at least 10 reviews). In our experiments, CAP is
statistically tied with MF using RMSE (the p-value is 0.8518). We use MF with bias
and parameter tuning, procedures that are not clear in the original paper [Tang et al,
2013]. Differently from them, we apply MF with review replaced by author dimen-
sion (to overcome cold-starts) and we evaluate the performance in five scenarios with
statistical tests. In turn, CAP is significantly better for ranking (the p-value is 0.0006).
We also observe that CAP produces results that are statistically worse than
RLFM using nDCG@5 and statistically tied regarding ERR. Thus, the new model
structure proposed in CAP is not significantly better. In fact, adding interaction
features with plain regression is better than incorporating with logistic transformation
and under certain constraints. In short, incorporating observed features reveals more
relevant than changing the model structure. Such comparison with RLFM was not
presented in the work proposing CAP [Tang et al, 2013], although the second is an
extension upon the first.
5The authors use a dataset from Epinions, whose link for dowloading is currently broken.
6.4. Discussion of Results 89
Table 6.8: Prediction of reviews helpfulness for the considered example by the three
best techniques, along with the ground-truth for helpfulness.
Review ID Truth SVR LR RLFM
177269 5 3.9217 3.9364 4.1653
7252 5 3.9196 3.9878 3.8722
82581 5 3.9190 3.9900 4.1733
11495 4 3.9187 3.9668 4.2678
70030 4 3.9184 4.0222 4.5724
71158 4 3.9183 3.9746 4.2137
143356 4 3.9180 4.0148 4.3899
130504 4 3.9175 3.9593 4.1021
171367 4 3.9170 3.9139 3.8908
74342 4 3.9167 3.9079 3.9972
5915 4 3.9161 3.8828 3.7378
26782 4 3.9149 3.8312 3.9288
18028 4 3.9148 3.8353 3.8629
Table 6.9: Prediction of rankings by the three best techniques for the top 5. The
ranking is obtained by sorting with predicted values and the corresponding true value
of each position is presented.
Position SVR LR RLFM
1 5 4 4
2 5 4 4
3 5 5 4
4 4 5 4
5 4 4 5
6.4.6 Illustrative Example
We now present an example of how SVR performs well for ranking, then contrasting
with LR and RLFM, the second and the third best ones, respectively. We consider
the product Daim Bar (a caramel bar covered with chocolate) and the reader with id
6044343. This user configures as a cold-start, with only one known vote equal to 5.0,
but engaged in the social network, with 92 trustors and 321 trustees. Table 6.8 lists
all reviews evaluated by such user, along with the true helpfulness given by him and
the prediction of SVR, LR and RLFM, the best ones for ranking.
In this example, we note that SVR predicts in a smaller range than the others,
limiting regression performance, but capturing relative order better. Indeed, the rank-
ing obtained by using SVR predictions is perfect, with nDCG@5 equals to 1.0, followed
by LR with 0.754806 and RLFM with 0.754806. Contrasting with RMSE performance,
90 Chapter 6. Experimental Analysis
(a) Review with helpfulness equal to 5 (ID: 177269)
(b) Review with helpfulness equal to 4 (ID: 11495)
Figure 6.6: Review excerpts with different helpfulness for the reader and product.
6.5. Inspecting the SVR Model 91
SVR yields 0.5238441; LR, 0.5011651; and RLFM, 0.508706. Thus, LR is the best for
regression, followed by RLFM and then SVR. Table 6.9 shows the true values at the
top 5 by ordering according to the prediction of each technique.
Figure 6.6 contrasts a review with helpfulness equal to 5 with one equal to 4.
We note that the former is more structured, divided into sections. It is also typically
longer, with 2, 267 tokens against 846 of the review with helpfulness 4. We also note
a quite ironic style of the review with helpfulness 5 that the reader probably enjoys.
On the other hand, the review 4 contains much personal information regarding the
difficult to find the product, and generally irrelevant information, like the size of the
bar, which may be the cause for the reader’s evaluation. This review receives a lower
grade even though the author has a better reputation, with more written reviews and
more trustors.
6.5 Inspecting the SVR Model
SVR is the best solution in our comparison of existing techniques, albeit being a regres-
sion model. By focusing only on higher errors, it is more robust than LR in finding a
representative pattern and less susceptible to overfitting and to noise in training data.
A perfect LR model (i.e., with no error) in training set implies in a perfect ranking,
but such goal is hard to accomplish. A perfect SVR model, on the other hand, does
not distinguish errors up to ε, thus not necessarily inducing a perfect ranking. For
example, considering ε = 0.1, if the true values 3.50 and 3.60 are predicted as 3.55 and
3.52, respectively, no error occurs in SVR. However, their order is predicted wrongly.
Nonetheless, when regression responses are integer values, a perfect SVR model
has a perfect ranking ∀ ε < 0.5. Indeed, for such ε, each possible integer response
has a margin of predicted values that does not overlap with another. For example,
supposing ε = 0.4 and a perfect SVR model, a true value of 5 is predicted in range
[4.6, 5.4], of 4 in range [3.6, 4.4], and so on. Note that, for any ε < 0.5, no overlapping
of predictions for different integer values occurs in a perfect SVR model. Figure 6.7
shows such behavior.
A perfect LR is much more challenging to model all responses perfectly with a
single generalization through features. By introducing an ε relaxation, SVR becomes
much more robust and, in the case of integer responses, very close to a ranking purpose.
92 Chapter 6. Experimental Analysis
6.5.1 Investigating Features Impact
We further investigate how variations in the SVR model impact the result. Specifi-
cally, we investigate: (i) eliminating one group of features of the same source entity or
relationship at a time; (ii) using feature selection criteria; and (iii) personalized against
non-personalized feature sets.
Table 6.10 shows the performance comparison for a single set of features belonging
to certain entity or interaction removed. Review features are the ones that impact the
most. Indeed, content explains great part of helpfulness, even when it consists of
simple textual statistics. Voter features are the second ones that reduces the average
performance the most, but such difference is not significant. In fact, removing voter
features is better in some cases and worse in others, probably working well only for
certain users. Similarity also does not significantly worsen the results. This is an
evidence that the similarity used, related to ratings given, does not represent well
users’ common interests. The remaining author and connection features impact the
result in a similar way and significantly. This elucidates the importance of author’s
reputation and the interaction on the trust network, the latter a personalized trait,
albeit the metric decreases only a little in average.
We also compare SVR with its non-personalized version, by considering only re-
view and author features and using reviews’ mean vote as responses, as Table 6.10
shows. The performance decreases significantly, then personalization explain the help-
fulness, albeit the obtained gain is not high. For instance, the second best predic-
tor, personalized LR, is significantly worse than non-personalized SVR (the p-value is
0
2
3
4
5
H
e
l
p
f
u
l
n
e
s
s
1Rank in g εε
Figure 6.7: Ranking obtained from a perfect SVR model with ε < 0.5. For integer
responses, absence of error in such model implies in absence of error in the ranking.
6.5. Inspecting the SVR Model 93
Table 6.10: SVR performance for one set of features removed at a time. A paired t-
test is performed between a new version and the original, with all features, and p-values
are shown. A red down arrow represents significant worsening, and a green equality
sign, a statistical tie.
Feature Set nDCG@5 p-value
All 0.9362± 0.0132 −
All \ Review 0.9078± 0.0256 0.0040
All \ Author 0.9356± 0.0130 0.0253
All \ Reader 0.9117± 0.0433 = 0.1300
All \ Similarity 0.9364± 0.0130 = 0.0973
All \ Connection 0.9356± 0.0130 0.0056
All \ Personalized 0.9319± 0.0053 0.0271
Table 6.11: SVR performance for a set of features defined by a feature selection
criterion, InfoGain and ρs. A paired t-test is performed between SVR with all features
and with the corresponding subset, then p-values are shown. A blue up arrow is
presented for significant improvement, and a green equal sign for a statistical tie.
Feature Set nDCG@5 p-value
All 0.9362± 0.0132 −
InfoGain Top 10 0.9373± 0.0130 0.0009
ρs Top 10 0.9374± 0.0133 0.0002
InfoGain Top 5 0.9174± 0.0304 = 0.0781
ρs Top 5 0.9374± 0.0131 0.0002
ρs Top 4 0.9293± 0.0239 = 0.1655
ρs Top 3 0.9355± 0.0159 = 0.5554
ρs Top 2 0.8762± 0.1147 = 0.1829
ρs Top 1 0.9038± 0.1012 = 0.3793
ρs Top 10 \ 2nd 0.9376± 0.0133 0.0004
ρs Top 5 \ 2nd 0.9374± 0.0131 0.0007
0.0011). Thus, the personalization hypothesis is not strong for the considered features.
Table 6.11 shows that, surprisingly, removing features may improve the perfor-
mance significantly, as represented by the performance using their top 10 features using
InfoGain and ρs. We then proceed with increasing reduction, evaluating their top 5.
Now, InfoGain has a lower performance in average, although not significantly, while ρs
improves significantly. Evaluating the performance for top 1 to top 4 of ρs selection
reveals that all of them are statistically tied with the original version. Top 1 presents a
great reduction in average, but it is not statistically different when performing a paired
94 Chapter 6. Experimental Analysis
t-test. This indicates that, in some contexts, only the size of the review is enough
for predicting their quality. Indeed, the higher the number of tokens, the higher the
probability of the user finding the information he seeks.
However, the top 2 was worse than the top 1, probably due to multicollinearity
of the number of tokens and sentences [Jain, 1991]. Thus, we investigate the two best
subsets of features, ρs top 5 and top 10, without the second best feature. Such removal
is represented by \ 2nd, resulting in ρs Top 10 \ 2nd as the overall best subset.
It is important to note that ρs Top 5 features, which is better than including all,
contains only one personalized feature: the voter average helpfulness. Thus, from all
personalized features, only user bias impact significantly. However, as personalization
was suggested to be a good approach under other analysis, as presented in Section
4.5, new features are yet to be evaluated to fully understand its role. Moreover, such
hypothesis may be important for a class of users, for instance, active ones.
6.5.2 Contrasting SVR Versions
Given the limited contribution of personalization using existing features and the pos-
sibility of leveraging ranking goal, we investigate extensions upon SVR attempting to
obtain an improved version for review recommendation task. Table 6.13 shows the per-
formance of each version and p-value of paired t-test evaluating their difference with
original SVR. Such versions are:
• Sel-SVR: Version using the best subset of features found in Section 6.5.1, which
corresponds to ρs Top 10 \ 2nd.
• Sel+New-SVR: Version using the best subset of features along with new proposed
interaction features described in Table 6.12.
• Sel+New-CoRaSVR: Stochastic gradient descent version of SVR with jointly
ranking optimization and adaptations for focusing on the top of the ranking.
As the considered personalized features do not provide a strong benefit for this
problem, we have investigated how the addition of simple interaction features, which
are all personalized, impacts the result. By inserting 9 new features, we observed a
significant improvement in SVR performance (Table 6.13). This indicates the great
potential in exploring new features.
Furthermore, a joint optimization is evaluated by combining SVR loss, which is
ε-insensitive, and cross-entropy loss for the relative position probability of pairs [Bal-
akrishnan and Chopra, 2012], composing a Combined Ranking and SVR (CoRaSVR).
6.5. Inspecting the SVR Model 95
Table 6.12: New proposed features derived from voter-author interaction.
Feature Description
c_vot_trust Boolean variable which is 1 if the voter trusts the author and 0 otherwise.
c_aut_trust Boolean variable which is 1 if the author trusts the voter and 0 otherwise.
c_close_from_vot Inverse of shortest path length from voter to author; 0 if no path exists.
c_close_from_aut Inverse of shortest path length from author to voter; 0 if no path exists.
s_diff_trustors Number of author’s trustors minus number of voter’s.
s_diff_num_reviews Number of author’s written reviews minus voter’s.
s_diff_pagerank Author’s PageRank score minus voter’s.
s_diff_close Author’s closeness centrality minus voter’s.
s_diff_eigen Author’s eigenvector centrality score minus voter’s.
Table 6.13: Variations upon SVR. A blue up arrow shows significant improvement.
Technique nDCG@5 p-value
SVR 0.9362± 0.0132 −
Sel-SVR 0.9376± 0.0133 0.0004
Sel+New-SVR 0.9381± 0.0132 0.0001
Sel+New-CoRaSVR 0.9386± 0.0127 0.0007
On the one hand, SVR is improved by considering relative positioning through cross-
entropy loss. On the other hand, pairwise learning to rank sparseness, due to many
tied pairs, is overcome by regression’s support.
By analyzing these results, it is remarkable how we observe statistical significance
for small absolute improvements. In fact, a small change in average nDCG@5 may
imply an improvement in several rankings. Also, this is not uncommon for a ranking
task, as it was also reported in the Yahoo Learning to Rank Challenge [Chapelle and
Chang, 2011]: the difference among the baseline, GBRT, and the winner was 0.00660
and among the first and second winners was 0.00075, regarding ERR.
Overall, selection of features, addition of new meaningful interaction features and
jointly optimizing a ranking loss improve over plain SVR. Although such statistical
improvements are small, they serve to show the unexplored potential in defining a
better set of features and a more advantageous loss. We detail the significant best
version, CoRaSVR, in the next Section.
6.5.3 Detailing CoRaSVR
CoRaSVR, a new method proposed as en extension upon SVR, performs a jointly
ranking optimization along with SVR. A ranking loss is applied for the probability
prediction of one review being ranked above another. Let ri and rj be a pair of
96 Chapter 6. Experimental Analysis
reviews and ri  rj denote the event when ri is ranked above rj. The probability
is predicted by applying the logistic function on the difference of feature vectors xri
and xrj . Specifically, given ranking Ou,p related to a reader u and a product p, the
probability of ri  rj : ri, rj ∈ R in such ranking is computed as [Balakrishnan and
Chopra, 2012]:
g(pri − prj) = g(wT (xri − xrj)) =
1
1 + e−w
T (xri−xrj )
(6.5)
The true values of such probabilities are either binary, indicating if one is ranked
above the other, or obtained by scaling the difference of relevance scores of both reviews
in [0, 1] [Sculley, 2010]. Then, we apply the cross-entropy loss to such estimation of
probabilities [Balakrishnan and Chopra, 2012], deriving a combination of a pairwise
and a pointwise loss, as follows:
ECoRaSV R =
∑
ri,rj∈Ou,p
γ
(
yrimax{|pri − yri | − ε, 0}+ yrjmax{|prj − yrj | − ε, 0}
)
+ (1− γ) yri,rj
(
log(1 + epri−prj )− yri,rj(pri − prj)
)
+ β ‖w‖2,
(6.6)
where yrk is the true helpfulness (response) and prk = wTxrk is the predicted helpfulness
of review rk in ranking Ou,p; β is the regularization factor; γ regulates the trade-off
between regression and ranking error [Sculley, 2010]; ε is SVR error tolerance; and
yri,rj is the truth for probabilities. In short, the loss is defined over pairs and considers
a pointwise error for both members and a pairwise loss for the pair. By simultaneously
satisfying REG and LTR restrictions, CoRaSVR most generic purpose is for regression,
then belonging in REG class. Sculley [2010] already addressed a combined ranking and
regression approach in a generic format . We remark specifics of CoRaSVR’s loss:
• Scaling of probabilities’ responses: Probabilities’ responses are computed as
yri,rj =
2yri−2yrj +31
62
, which scales individual responses exponentially and the dif-
ference, in range [0, 1]. Such individual exponential scaling helps distinguishing
reviews for a ranking task, as already explored for pointwise and pairwise ap-
proaches [Balakrishnan and Chopra, 2012]. However, we did not include expo-
nential scaling in regression loss as it decreases the performance in validation set.
Besides, we scale the difference of responses in range [0, 1] due to providing better
results in validation set than a binary set.
• Weighting error by true values: We have further adapted the loss by weighting
each error according to the true value. This way, we prioritize prediction of re-
6.5. Inspecting the SVR Model 97
views with higher helpfulness values and focus on the top, which contains the ones
of main interest. Similarly, pairs with a higher gap in responses are emphasized
in cross-entropy loss.
• Ignoring zero helpfulness: Finally, we have disregarded votes equal to zero for
both regression and ranking error due to better performance in validation set.
Such evaluations do not seem to follow the same pattern as the others, as noted
in Section 4.6 and may mislead model fitting.
• Considering only pairs with a response equal or greater to the second maximum in
the ranking: As described by Balakrishnan and Chopra [2012], accounting only for
responses equal to the second maximum or greater in a ranking helps prioritizing
prediction at the top. When a pair is composed by two responses below the
second maximum, typically none is at the top of the ranking. Therefore, only
pairs with at least one review among the highest helpfulness, specifically at least
equal to the second maximum in the corresponding ranking, are considered in
the optimization.
Furthermore, we apply stochastic gradient descent to learn the model, by using
the following derivative for the error on a single instance ri, rj:
∂rk =

γ yrkxrk prk − yrk > ε
−γ yrkxrk prk − yrk < −ε
0 |prk − yrk | ≤ ε,
∂ri,rj = (1− γ) yri,rj (g(pri − prj)− yri,rj)(xri − xrj),
∂E
ri,rj
CoRaSV R
∂w
= ∂ri + ∂rj + ∂ri,rj + βw,
(6.7)
where the derivation is presented in Appendix B.4. The derivative at |prk − yrk | = ε is
undefined, but we that it is zero. Then, the stochastic updates are performed in the
same way as described for BETF in Section 5.6.
The optimal parameter values are computed in validation and are presented in
Table 6.14. A higher weight is given to regression loss by parameter γ, thus showing its
contribution in overcoming pairwise sparseness. Note that the number of interactions
refers to each single instance update (opposed to BETF and MF, where one iteration
goes through all data), as we sample the existing pairs due to their high cardinality.
We do not perform early stopping under a convergence tolerance because the cost of
98 Chapter 6. Experimental Analysis
Table 6.14: List of parameters with a description, evaluated range or set, and the
best value for ranking (nDCG@5*) for CoRaSVR.
Symbol Description Range/Set nDCG@5*
I # Iterations (105, 107) 107
α Initial learning rate (10−4, 1) 10−2
β Regularization factor (10−3, 10) 10−2
γ Ranking-regression trade-off (0, 1) 0.8
ε Regression tolerance on error (10−2, 1) 10−1
p Exponent of learning rate decay (0.15, 0.35) 0.2
bias Addition of bias {true, false} false
computing the current loss over all pairs is too high, since it is very costly to enumerate
them all (they reach 2,241,239 pairs in one split).
6.6 Concluding Remarks
By performing a unified experimental evaluation with techniques from five different
paradigms, we assess their performance and derive insights of key properties relevant
for review recommendation problem with a ranking purpose. We remark that:
(i) Regression is significantly different than ranking, thus the latter should be prior-
itized for being aligned with the final purpose;
(ii) Chronological is not the same as random split between training and test set and
the former should be the choice as it is a practical imposition;
(iii) Simpler models generalizes better the training set for test prediction, represented
by linear predictors, a small subset of features, high regularization, few iterations
and others;
(iv) Traditional good approaches in recommendation, such as collaborative filtering
and addition of biases, are not good assumptions for a situation plenty of review
cold-starts and with a ranking goal;
(v) SVR performs well for ranking under integer relevance scores by introducing ε
relaxation;
(vi) SVR is significantly improved by selecting a subset of features through Spearman
correlation (ρs), by addition of new interaction features and by jointly incorpo-
rating a ranking loss.
Chapter 7
Conclusion
Reviews are a source of information being increasingly used and produced by con-
sumers. There is a large volume of reviews available, which is a challenge for readers
seeking useful information. A common solution is to rank the reviews according to their
global helpfulness, neglecting that users have different preferences, and thus perceive
the helpfulness of a given review differently. Hence, recent works model the task as a
personalized problem. In such context, we define review recommendation as a top-n
task, borrowing concepts from recommender systems and following a recent trend of
ranking-oriented modeling.
We used a dataset from Ciao UK to perform our experiments, which has the great
advantage of containing personalized votes. Also, such dataset has an underlying trust
network as an important source of information for recommendation. We observed a
great discrepancy of types of users and reviews in terms of related information available.
This situation motivates recommendation as a way of finding unpopular reviews in the
long tail, but also represents a challenge to precisely recommend for very different
users. We also noted that authors tend to have a consistent helpfulness across reviews;
however, such stability increases when considering reviews’ votes, and even more for
votes related to an author-reader pair, a personalized criterion. This motivates seeking
for personalized solutions as the next frontier for improvement.
A unified experimental design was performed to compare a rich and heteroge-
neous set of personalized solutions for review recommendation, from five paradigms,
that were not previously compared against each other. Our experiments considered the
best parametric configuration and RMSE, nDCG@p and ERR metrics. We concluded
that personalization explains a portion of helpfulness value, being personalized SVR
significantly the best approach, with 95% of confidence. Moreover, the simplicity of the
model is advantageous, possibly suffering less from overfitting and producing a more
99
100 Chapter 7. Conclusion
accurate generic model. Including observed features into the model is extremely impor-
tant due to the sparseness of the problem, which has many cold-starts. Despite aiming
at a ranking goal, pairwise and listwise learning to rank predictors suffer considerably
from sparseness due to ties. Our evaluation indicated that a regression metric (e.g.,
RMSE) leads to significantly different conclusions compared to a ranking one (e.g.,
nDCG@p). However, the latter is more suitable as it represents well the final purpose
of presenting a sorted list to a user. Finally, including new interaction features and
jointly optimizing a ranking goal significantly improve SVR performance.
7.1 Contributions
The main contributions of our work are summarized as follows.
• Extensive comparison of techniques from multiple paradigms for re-
view recommendation using in a real dataset. We compared a wealthy
set of techniques, from different paradigms, for review recommendation in a real
dataset from Ciao UK. Before diving into techniques and experiments, we have
investigated the dataset and helpfulness voting dynamics, demonstrating, for
instance, that personalization seems to be the right hypothesis for helpfulness
prediction. Regarding the techniques evaluated, there is no previous work, to the
best of our knowledge, that considers experiments with personalized LR, SVR,
GBRT, RSVM, LMART and RLFM for such task. All considered solutions were
compared in a unified and statistical experimental design, with parameter tuning.
Specially, we evaluated specialized approaches for review recommendation, CAP
and BETF, not yet compared. We studied such approaches, providing precise
algorithms and formulas, which are not available in the original works. Specially,
we identified that CAP is a extension upon RLFM and derived their formulas
based on such finding for applying a MCEM algorithm. We have also imple-
mented such approaches, with publicly available code. Overall, our comparison
provided several insights about an effective ranking-oriented review recommender
system. Specifically, good performance typically share the following aspects: (i)
including observed features; (ii) building simpler models; and (iii) personalized
prediction.
• Assessment of review recommendation solutions under a ranking per-
spective, besides a regression one. A recent trend consists in considering
the recommendation problem as a ranking task. Indeed, the great majority of
7.2. Future Work 101
solutions predict a real value for rating, but only to rank the results and present
the top ones to the user. By considering the problem as a top-n task directly, the
over restriction of regression is overcome and results tend to improve. Although
learning to rank techniques were not the best performers in our comparison, we
tuned and found a most suitable solution by using a ranking measure. This
perspective yields the best final result presented to the user.
• Exploitation of different versions of the best method – SVR.We explored
variations of SVR by applying feature selection, incorporating new features and
simultaneously optimizing a ranking loss. We showed that only a set of 5 out of 43
features is enough for providing an effective predictor. Content features, derived
from text, are the ones that impact the most. Furthermore, we defined new
features to investigate the potential for improvement in personalization, obtaining
significantly better results. We also defined a combined ranking and SVR method
– CoRaSVR, which has an optimization objective composed of a ranking and a
regression loss, yielding significant improvement.
7.2 Future Work
Future work includes exploring new observed features, specially personalized ones and
related to review-user interaction, and taking more advantage of the ranking purpose.
Additionally, considering the few limitations of Ciao UK dataset, experiment with a
new one would be very advantageous. The model could also take into account the aging
of reviews and other timely patterns that impact the helpfulness. Also, the model could
specialize in different types of users. Finally, other metrics may be evaluated to better
represent user’s interest. We detail each future direction as follows.
• Assessment using other datasets. We only consider one dataset, as no other
platform, to the best of our knowledge, has publicly available votes discriminated
by each user. If another dataset becomes available with such information, eval-
uating under this new scenario will help understand the extent of the patterns
observed in this work. Another possibility consists in deriving helpfulness opinion
implicitly.
• Further enhancements upon SVR. We investigate new predictor versions
built upon SVR by including new features and a ranking loss. Yet, many other
features may be explored. For instance, more semantic features may be extracted
from text (e.g.: topics) and be used for relating the reader with the author or
102 Chapter 7. Conclusion
review in deeper aspects such as way of thinking and opinions. Besides, several
other ranking losses may be composed, for instance, a listwise one.
• Time dynamics included in the model. An old review is likely not going to
be as relevant as a new one; but that may depend on the product and category.
Also, there may be underlying timing effects in users’ behavior. Exploring such
time aware model may provide effectiveness improvement.
• Applying specific models for different types of users. Personalization
was not a strong hypothesis in our evaluation, as non-personalized SVR was
statistically worse than SVR, but better than the second best. Therefore, an
investigation of how personalization is important for different groups of users is
the next step. Indeed, we show that users differ deeply in participatory behavior.
For instance, the hypothesis that personalization works well only for active users
may be verified.
• Evaluation of other metrics. When providing a set with most helpfulness
reviews to a user, other characteristics may be important, such as diversity. The
most distinct the features and opinions described in reviews, the wider knowledge
a user would obtain; thus, allowing to perform a better decision. In general terms,
other means of evaluating the results may be investigated to better account for
user needs.
Bibliography
Adomavicius, G. and Tuzhilin, A. (2005). Toward the next generation of recommender
systems: A survey of the state-of-the-art and possible extensions. IEEE Trans. on
Knowl. and Data Eng., 17(6):734–749.
Agarwal, D. and Chen, B.-C. (2009). Regression-based latent factor models. In Procs.
of KDD, Paris, France.
Amatriain, X., Pujol, J. M., and Oliver, N. (2009). I like it... i like it not: Evaluating
user ratings noise in recommender systems. In Procs. of UMAP, Trento, Italy.
Anderson, M. (2014). Local consumer review survey 2014. https://www.
brightlocal.com/2014/07/01/local-consumer-review-survey-2014.
Anderson, M. L. and Magruder, J. R. (2012). Learning from the crowd: Regression
discontinuity estimates of the effects of an online review database. Economic Journal,
122(563):957–989.
Bakhshi, S., Kanuparthy, P., and Gilbert, E. (2014). Demographics, weather and online
reviews: A study of restaurant recommendations. In Procs. of WWW, Seoul, Korea.
Balakrishnan, S. and Chopra, S. (2012). Collaborative ranking. In Procs. of WSDM,
Seattle, WA, USA.
Balog, K. and Ramampiaro, H. (2013). Cumulative citation recommendation: Classi-
fication vs. ranking. In Procs. of SIGIR, Dublin, Ireland.
Bidart, R., Pereira, A., Almeida, J., and Lacerda, A. (2014). Where should i go? city
recommendation based on user communities. In Procs. of LA-WEB, Ouro Preto,
Minas Gerais, Brazil.
Bjørkelund, E., Burnett, T. H., and Nørvåg, K. (2012). A study of opinion mining and
visualization of hotel reviews. In Procs. of iiWAS, Bali, Indonesia.
103
104 Bibliography
Booth, J. G. and Hobert, J. P. (1999). Maximizing generalized linear mixed model
likelihoods with an automated monte carlo em algorithm. JRRS: B, 61(1):265–285.
Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullen-
der, G. (2005). Learning to rank using gradient descent. In Procs. of ICML, Bonn,
Germany.
Burges, C. J. (2010). From RankNet to LambdaRank to LambdaMART: An overview.
Technical report MSR-TR-2010-82, Microsoft Research.
Canuto et al., S. D. (2013). A comparative study of learning-to-rank techniques for
tag recommendation. JIDM, 4(3):453–468.
Chapelle, O. and Chang, Y. (2011). Yahoo! learning to rank challenge overview. In
Procs. of the Yahoo! Learning to Rank Challenge - ICML, Bellevue, WA, USA.
Chapelle, O. and Keerthi, S. S. (2010). Efficient algorithms for ranking with SVMs.
IR, 13(3):201–215.
Chapelle, O., Metlzer, D., Zhang, Y., and Grinspan, P. (2009). Expected reciprocal
rank for graded relevance. In Procs. of CIKM, Hong Kong, China.
Christakopoulou, K. and Banerjee, A. (2015). Collaborative ranking with a push at
the top. In Procs. of WWW, Florence, Italy.
Christodoulides, G. and Jevons, C. (2011). The voice of the consumer speaks forcefully
in brand identity: User-generated content forces smart marketers to listen. Journal
of Advertising Research, 51(1):101–112.
Cremonesi, P., Koren, Y., and Turrin, R. (2010). Performance of recommender algo-
rithms on top-n recommendation tasks. In Procs. of RecSys, Barcelona, Spain.
D’Addio, R. and Garcia Manzato, M. (2014). A collaborative filtering approach based
on user’s reviews. In Intelligent Systems (BRACIS), 2014 Brazilian Conference on,
pages 204–209.
Dahlgren, S., Johnson, A., and Liljenberg, C. (2015). Online reviews – what motivates
you? a qualitative study of customers’ motivation to write online reviews. Bachelor
Thesis – Linnaeus University, Sweden.
Danescu-Niculescu-Mizil, C., Kossinets, G., Kleinberg, J., and Lee, L. (2009). How
opinions are received by online communities: A case study on amazon.com helpful-
ness votes. In Procs. of WWW, Madrid, Spain.
Bibliography 105
Dellarocas, C. and Narayan, R. (2006). What motivates consumers to review a product
online? a study of the product-specific antecedents of online movie reviews. In Procs.
of the WISE, Evanston, IL, USA.
Deshpande, M. and Karypis, G. (2004). Item-based top-n recommendation algorithms.
ACM Trans. Inf. Syst., 22(1):143–177.
Do, C. B. and Batzoglou, S. (2008). What is the expectation maximization algorithm?
Nature biotechnology, 26(8):897–900.
Duan, H., Zhai, C., Cheng, J., and Gattani, A. (2013). Supporting keyword search in
product database: A probabilistic approach. PVLDB, 6(14):1786–1797.
Easley, D. and Kleinberg, J. (2010). Networks, Crowds, and Markets: Reasoning About
a Highly Connected World. Cambridge University Press.
Esuli, A. and Sebastiani, F. (2006). SentiWordNet: A publicly available lexical resource
for opinion mining. In Procs. of LREC, Genoa, Italy.
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine.
The Annals of Statistics, 29(5):pp. 1189–1232.
Fu, X. and Asorey, H. (2015). Data-driven product innovation. In Procs. of KDD,
Sydney, NSW, Australia.
Ganesan, K. and Zhai, C. (2012). Opinion-based entity ranking. Inf. Retr., 15(2):116–
150.
Hennig-Thurau, T., Gwinner, K. P., Walsh, G., and Gremler, D. D. (2004). Electronic
word-of-mouth via consumer-opinion platforms: What motivates consumers to ar-
ticulate themselves on the internet? Journal of Interactive Marketing, 18(1):38–52.
ISSN 1520-6653.
Hu, M. and Liu, B. (2004). Mining and summarizing customer reviews. In Procs. of
KDD, Seattle, WA, USA.
Hu, Y., Koren, Y., and Volinsky, C. (2008). Collaborative filtering for implicit feedback
datasets. In Procs. of ICDM, Pisa, Italy.
Hyndman, R. J. and Koehler, A. B. (2006). Another look at measures of forecast
accuracy. International Journal of Forecasting, 22(4):679–688.
106 Bibliography
Jain, R. (1991). The art of computer systems performance analysis - techniques for
experimental design, measurement, simulation, and modeling. Wiley.
Joachims, T. (2006). Training linear SVMs in linear time. In Procs. of KDD, Philadel-
phia, PA, USA.
Jurafsky, D., Chahuneau, V., Routledge, B., and Smith, N. (2014). Narrative framing
of consumer sentiment in online restaurant reviews. First Monday, 19(4).
Kim et al, S.-M. (2006). Automatically assessing review helpfulness. In Procs. of
EMNLP, Sydney, Australia.
Koren, Y. (2008). Factorization meets the neighborhood: A multifaceted collaborative
filtering model. In Procs. of KDD, Las Vegas, NV, USA.
Koren, Y., Bell, R., and Volinsky, C. (2009). Matrix factorization techniques for rec-
ommender systems. Commun. ACM, 42(8):30–37.
Lee, S. and Choeh, J. Y. (2014). Predicting the helpfulness of online reviews using
multilayer perceptron neural networks. Expert Syst. Appl., 41(6):3041–3046.
Li, H. (2011). A short introduction to learning to rank. IEICE Transactions, 94-
D(10):1854–1862.
Liben-Nowell, D. and Kleinberg, J. (2007). The link-prediction problem for social
networks. JASIST, 58(7):1019–1031.
Liu, Y., Huang, X., An, A., and Yu, X. (2008). Modeling and predicting the helpfulness
of online reviews. In Procs. of ICDM, Miami, FL, USA.
Lu, W., Chen, S., Li, K., and Lakshmanan, L. V. S. (2014). Show me the money:
Dynamic recommendations for revenue maximization. PVLDB, 7(14):1785–1796.
Lu et al, Y. (2010). Exploiting social context for review quality prediction. In Procs.
of WWW, Raleigh, NC, USA.
Lucchese et al, C. (2015). Quickscorer: A fast algorithm to rank documents with
additive ensembles of regression trees. In Procs. of SIGIR, Santiago, Chile.
Machanavajjhala, A., Korolova, A., and Sarma, A. D. (2011). Personalized social
recommendations: Accurate or private. PVLDB, 4(7):440–450.
Martin, L. and Pu, P. (2014). Prediction of helpful reviews using emotions extraction.
In Procs. of AAAI, Québec City, Canada.
Bibliography 107
McAuley, J. and Leskovec, J. (2013). Hidden factors and hidden topics: Understanding
rating dimensions with review text. In Procs. of RecSys, Hong Kong, China.
Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Team, T. G. B.,
Pickett, J. P., Holberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak,
M. A., and Aiden, E. L. (2011). Quantitative analysis of culture using millions of
digitized books. Science, 331(6014):176–182.
Miller, G. A. (1995). WordNet: A lexical database for english. Commun. ACM,
38(11):39–41.
Moghaddam, S., Jamali, M., and Ester, M. (2012). ETF: Extended Tensor Factor-
ization model for personalizing prediction of review helpfulness. In Procs. of the
WSDM, Seattle, WA, USA.
Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. The MIT Press.
Ng, A. Y. (2004). Feature selection, l1 vs. l2 regularization, and rotational invariance.
In Procs. of ICML, Banff, Alberta, Canada.
O’Mahony, M. P. and Smyth, B. (2009). Learning to recommend helpful hotel reviews.
In Procs. of RecSys, New York, NY, USA.
Parra, D., Karatzoglou, A., Yavuz, I., and Amatriain, X. (2011). Implicit feedback
recommendation via implicit-to-explicit ordinal logistic regression mapping. In Procs.
of CARS.
Pedregosa, F. e. a. (2011). Scikit-learn: Machine learning in Python. Journal of
Machine Learning Research, 12(Oct):2825–2830.
Powers, D. M. W. (2007). Evaluation: From Precision, Recall and F-Factor to ROC,
Informedness, Markedness & Correlation. Technical report SIE-07-001, School of
Informatics and Engineering, Flinders University, Adelaide, Australia.
Rensink, J. (2013). What motivates people to write online reviews and which role does
personality play? A study providing insights in the influence of seven motivations on
the involvement to write positive and negative online reviews and how five personality
traits play a role. Master Thesis – University of Twente, Netherlands.
Ricci et al, F. (2010). Recommender Systems Handbook. Springer-Verlag, 1st edition.
Sculley, D. (2010). Combined regression and ranking. In Procs. of KDD, Washington,
DC, USA.
108 Bibliography
Shi, Y., Karatzoglou, A., Baltrunas, L., Larson, M., and Hanjalic, A. (2013). xCLiMF:
Optimizing expected reciprocal rank for data with multiple levels of relevance. In
Procs. of RecSys, Hong Kong, China.
Shi, Y., Karatzoglou, A., Baltrunas, L., Larson, M., Oliver, N., and Hanjalic, A. (2012).
Climf: Learning to maximize reciprocal rank with collaborative less-is-more filtering.
In Proceedings of the Sixth ACM Conference on Recommender Systems, RecSys ’12,
pages 139–146, New York, NY, USA. ACM.
Shi, Y., Larson, M., and Hanjalic, A. (2014). Collaborative filtering beyond the user-
item matrix: A survey of the state of the art and future challenges. ACM Comput.
Surv., 47(1):3:1–3:45.
Simonson, I. and Rosen, E. (2014). What Marketers Misunderstand About Online
Reviews. Harvard Business Review.
Sipos, R., Ghosh, A., and Joachims, T. (2014). Was this review helpful to you?: It
depends! context and voting patterns in online content. In Procs. of WWW, Seoul,
Korea.
Smola, A. J. and Schölkopf, B. (2004). A tutorial on support vector regression. Statistics
and Computing, 14(3):199–222.
Tang et al, J. (2013). Context-aware review helpfulness rating prediction. In Procs. of
RecSys, Hong Kong, China.
Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer-Verlag.
Vasconcelos, M., Almeida, J. M., and Gonçalves, M. A. (2015). Predicting the popu-
larity of micro-reviews: A foursquare case study. Information Sciences, 325:355–374.
Wang, J., Zhao, J., Guo, S., North, C., and Ramakrishnan, N. (2014). Recloud:
Semantics-based word cloud visualization of user reviews. In Procs. of GI, Montreal,
Québec, Canada.
Weimer, M., Karatzoglou, A., Le, Q. V., and Smola, A. J. (2007). COFI RANK –
maximum margin matrix factorization for collaborative ranking. In Procs. of NIPS,
Vancouver, British Columbia, Canada.
Xu, J., Zheng, X., and Ding, W. (2012). Personalized recommendation based on reviews
and ratings alleviating the sparsity problem of collaborative filtering. In Procs. of
ICEBE.
Bibliography 109
Yang, Y. and Pedersen, J. O. (1997). A comparative study on feature selection in text
categorization. In Procs. of ICML, Nashville, TN, USA.
Yin et al, H. (2012). Challenging the long tail recommendation. PVLDB, 5(9):896–907.
Ypma, T. J. (1995). Historical development of the Newton-Raphson method. SIAM
Rev., 37(4):531–551.
Yue, Y., Finley, T., Radlinski, F., and Joachims, T. (2007). A support vector method
for optimizing average precision. In Procs. of SIGIR, Amsterdam, The Netherlands.
Zhan, J., Loh, H. T., and Liu, Y. (2009). Gather customer concerns from online product
reviews - a text summarization approach. Expert Syst. Appl., 36(2):2107–2115.
Zhang, R. and Tran, T. (2011). An information gain-based approach for recommending
useful product reviews. Knowl. Inf. Syst., 26(3):419–434.

Appendix A
Assumptions about Features
For r_comp_ratio we consider comparative adverbs and adjectives; r_uni_ratio is
considered as number of unique words divided by the total number of word occur-
rences; r_avg_sent uses sentences’ length measured in tokens; r_cap_ratio assumes
the capitalization of the first letter of the sentence; statistics of difference of author
and voter ratings (s_diff_avg_rat, s_diff_max_rat, s_diff_min_rat) are calculated
as the author’s value minus the voter’s one; adamic-adar features (c_adam_trustees,
c_adam_trustors) considers neighbors in the undirected version of the trust graph;
c_katz is calculated considering paths from voter to author in the directed graph and
using β = 0.005, a standard value [Liben-Nowell and Kleinberg, 2007]; feature r_kl
is evaluated removing stopwords and considering all reviews of the product in train-
ing set and additionally the review being modeled, when in test set; r_pos_ratio and
r_neg_ratio are the ratio of words whose positive score in first synset of SentiWordNet
[Esuli and Sebastiani, 2006] is higher than negative and negative is higher than posi-
tive, respectively, and positive and negative scores are exchanged if there is a preceding
negative word in the same sentence; foreign words were identified by the absence in
WordNet [Miller, 1995].
111

Appendix B
Mathematical Proofs
B.1 Complete Log-Likelihood for CAP
In this subsection, we derive the complete log-likelihood for the CAP system.
For this calculation, we use the formulas for normal distribution (N ) and multi-
variate normal distribution (MVN ) defined below.
PN (x, µ, σ) =
1
σ
√
2pi
e
−(x−µ)2
2σ2 (B.1)
PMVN (x, µ,Σ) =
1√
(2pi)kdetΣ
e−
1
2
(x−µ)TΣ−1(x−µ) (B.2)
Using these probability densities, the log-likelihood is given by the joint proba-
bility of observing the data (yhij) and the latent variables Ω. This is equivalent to the
factorization over the probabilistic graphical model, presented in Figure C.2. Thus,
L(Ω,Θ) =
∏
ui,rj∈V0×R0
P (yhij |Ω) ·
∏
ui∈V0
P (αi) ·
∏
rj∈R0
P (βj) ·
∏
ak∈A0
P (ξk)
·
∏
ui,ak∈S0
P (γki ) ·
∏
ui,ak∈C0
P (λki ) ·
∏
ui∈V0
P (ui) ·
∏
rj∈A0
P (vj)
(B.3)
All latent variables are considered to be independent of each other. Only yhij
depends on all variables in Ω through the mean of distribution, phij . Then, we replace
the probabilities using the probability density formula. For now on, we omit specific
elements and sets in summations by using only indices i for ui, j for rj and k for ak.
113
114 Appendix B. Mathematical Proofs
L(Ω,Θ) =
∏
i,j
 1
σh
√
2pi
e
−(yhij−phij )
2
2σ2
h
 ·∏
i
(
1
σα
√
2pi
e
−(αi−dT xui )
2
2σ2α
)
·
∏
j
 1
σβ
√
2pi
e
−(βj−gT xrj )
2
2σ2
β
 ·∏
k
 1
σξ
√
2pi
e
−(ξk−bT xak )
2
2σ2
ξ

·
∏
i,k
(
1
σγ
√
2pi
e
−(γki −g(rT xsik ))
2
2σ2γ
)
·
∏
i,k
 1
σλ
√
2pi
e
−(λki −g(hT xcik ))
2
2σ2
λ

·
∏
i
(
1√
(2pi)KdetAu
e−
1
2
(ui−WTxui )TA−1u (ui−WTxui )
)
·
∏
j
(
1√
(2pi)KdetAv
e−
1
2
(vj−VTxrj )TA−1v (vi−VTxrj )
)
(B.4)
To obtain the log-likelihood `(Ω,Θ), we take the neperian logarithm of this value
using the property that the logarithm of a product of terms is the sum of the logarithm
of each term.
`(Ω,Θ) =
∑
i,j
(
log
1
σh
√
2pi
− (yhij − phij )
2
2σ2h
)
+
∑
i
(
log
1
σα
√
2pi
− (αi − d
Txui)
2
2σ2α
)
+
∑
j
(
log
1
σβ
√
2pi
− (βj − g
Txrj )
2
2σ2β
)
+
∑
k
(
log
1
σξ
√
2pi
− (ξk − b
Txak)
2
2σ2ξ
)
+
∑
i,k
(
log
1
σγ
√
2pi
− (γ
k
i − g(rTxsik))2
2σ2γ
)
+
∑
i,k
(
log
1
σλ
√
2pi
− (λ
k
i − g(hTxcik))2
2σ2λ
)
+
∑
i
(
log
1√
(2pi)KdetAu
− 1
2
(ui −WTxui)TA−1u (ui −WTxui)
)
+
∑
j
(
log
1√
(2pi)KdetAv
− 1
2
(vj −VTxrj )TA−1v (vi −VTxrj )
)
(B.5)
Simplifying further,
B.1. Complete Log-Likelihood for CAP 115
`(Ω,Θ) = −
∑
i,j
(
log σh +
1
2
log 2pi +
(yhij − phij )2
2σ2h
)
−
∑
i
(
log σα +
1
2
log 2pi +
(αi − dTxui)2
2σ2α
)
−
∑
j
(
log σβ +
1
2
log 2pi +
(βj − gTxrj )2
2σ2β
)
−
∑
k
(
log σξ +
1
2
log 2pi +
(ξk − bTxak)2
2σ2ξ
)
−
∑
i,k
(
log σγ +
1
2
log 2pi +
(γki − g(rTxsik))2
2σ2γ
)
−
∑
i,k:k∈Ti
(
log σλ +
1
2
log 2pi +
(λki − g(hTxcik))2
2σ2λ
)
−
∑
i
(
K
2
log 2pi +
1
2
log detAu +
1
2
(ui −WTxui)TA−1u (ui −WTxui)
)
−
∑
j
(
K
2
log 2pi +
1
2
log detAv +
1
2
(vj −VTxrj )TA−1v (vi −VTxrj )
)
(B.6)
Collapsing the terms which are constant across EM iterations, whose sum is rep-
resented by C, and replacing standard deviation by variance, we obtain the following:
`(Ω,Θ) = C −
∑
i,j
1
2
(
log σ2h +
(yhij − phij )2
σ2h
)
−
∑
i
1
2
(
log σ2α +
(αi − dTxui)2
σ2α
)
−
∑
j
1
2
(
log σ2β +
(βj − gTxrj )2
σ2β
)
−
∑
k
1
2
(
log σ2ξ +
(ξk − bTxak)2
σ2ξ
)
−
∑
i,k
1
2
(
log σ2γ +
(γki − g(rTxsik))2
σ2γ
)
−
∑
i,k
1
2
(
log σ2λ +
(λki − g(hTxcik))2
σ2λ
)
−
∑
i
1
2
(
log detAu + (ui −WTxui)TA−1u (ui −WTxui)
)
−
∑
j
1
2
(
log detAv + (vj −VTxrj )TA−1v (vi −VTxrj )
)
(B.7)
We can expand phij in terms of latent variables, obtaining the following.
116 Appendix B. Mathematical Proofs
`(Ω,Θ) = C − 1
2
∑
i,j
(
1
σ2h
(yhij − uTi vj − αi − βj − ξk − δ1(i, k)γki − δ2(i, k)λki )2 + log σ2h
)
− 1
2
∑
i
(
(αi − dTxui)2
σ2α
+ log σ2α + (ui −WTxui)TA−1u (ui −WTxui) + log(detAu)
)
− 1
2
∑
j
(
(βj − gTxrj )2
σ2β
+ log σ2β + (vj −VTxrj )TA−1v (vi −VTxrj ) + log(detAv)+
)
− 1
2
∑
k
(
log σ2ξ +
(ξk − bTxak)2
σ2ξ
)
− 1
2
∑
i,k
(
log σ2γ +
(γki − g(rTxsik))2
σ2γ
)
− 1
2
∑
i,k:k∈Ti
(
log σ2λ +
(λki − g(hTxcik))2
σ2λ
)
(B.8)
We can observe that this is a modification of the formula presented for Regression-
based Latent Factor Models [Agarwal and Chen, 2009] by incorporating new latent
variables and dropping linear regression on dyadic features term (xTijb). Using the
definition of E[φ|Rest] = φˆ such that `′(φˆ) = 0 and V [φ|Rest] = `′′(φˆ)−1 [Tang et al,
2013], where φ is any variable, we have estimations of mean and variance, thus the full
conditional probabilities, to be used in Gibbs Sampling.
B.2 Derivatives of Expectation of Log-Likelihood
for CAP
We do not have a closed formula for the joint distribution of variables due to the
product uTi vj, that is, we do not know the distribution of a product ofMVN s. Then,
this distribution is approximated through Gibbs Sampling. An empirical expectation
(E∗) can be taken from this approximated distribution [Booth and Hobert, 1999],
considering a sample of L elements.
E[X] =
∑
x
x p(x),
E∗[X] =
1
L
L∑
i=1
xi,
(B.9)
where xi is the value of random variable X in i-th sample. We want the expected value
of a function. Thus,
B.2. Derivatives of Expectation of Log-Likelihood for CAP 117
E[f(X)] =
∑
x
f(x) p(x),
E∗[f(X)] =
1
L
L∑
i=1
f(xi)
After Gibbs Sampling step, we have L samples of the whole latent vector {Ω} =
{Ω(1),Ω(2), . . . ,Ω(L)}. Then, we may compute the probabilities for each occurred value
or simply sum over the function f evaluated over each sample to obtain the empirical
distribution of `.
E∗ =
∑
{Ω}
p(Ω|yij ,Θ) `(Ω,Θ)
=
1
L
L∑
l=1
`(Ωl,Θ),
(B.10)
where {Ω} is the set of all possible full latent vectors, Ωl is the value of latent factor
vector in l-th sample. In this case, the log-likelihood is a function of the latent variables,
but, after the sample, we marginalize out of these values through the summation. We
know how to calculate `(Ω,Θ) because we have the values of the latent variables on
each sample as well as the true value of the helpfulness (yhij). Then, the result E∗ is
a function of only the parameters Θ, after the sample.
Since the value of E* is known, it is possible to derive it. As it is a summation,
we can use the property of the derivative of a sum being the sum of derivatives. The
formulas for all parameters can be easily found and set to zero in order to find an
optimal value, which falls in OLS linear regression on empiric mean of each variable.
However, parameters r and h are an exception, as we have a non-linear regression to
solve due to the sigmoid transformation g. Instead, Newton-Raphson method can be
used, which is a method for finding a root of a function. Since we want to maximize
E∗, we find the root of its derivative through this method. Consequently, we have to
calculate the first and second partial derivatives of E∗ with respect to r and h, the first
is the function whose root is going to be estimated and the latter is its derivative. Only
terms in `(Θ,Ω) containing r are relevant and, first, we consider a single dimension rj.
118 Appendix B. Mathematical Proofs
∂E∗
∂rj
=
∂
∂rj
1
L
L∑
l=1
−1
2
∑
i,k
(
(γkli − g(rTxsik))2
σ2γ
)
=
1
L
L∑
l=1
−1
2
2
σ2γ
∑
i,k
(γkli − g(rTxsik))(−1)g′(rTxsik)pkij
=
1
σ2γ
1
L
L∑
l=1
∑
i,k
(γkli − g(rTxsik))g′(rTxsik)pkij
=
1
σ2γ
∑
i,k
1
L
L∑
l=1
(γkli − g(rTxsik))g′(rTxsik)pkij
=
1
σ2γ
g′(rTxsik)p
k
ij
∑
i,k
1
L
L∑
l=1
γkli −
1
l
∑
{δlij}
g(rTxsik)
=
1
σ2γ
g′(rTxsik)p
k
ij
∑
i,k
(
E∗(γki )−
1
l
lg(rTxsik)
)
=
1
σ2γ
∑
i,k
(
E∗(γki )− g(rTxsik)
)
g′(rTxsik)p
k
ij ,
(B.11)
where g′(rTxsik) is
∂g(rTxsik )
∂(rTxsik )
. E∗(γki ) is the empiric mean of γki variable considering L
samples. This solution is no different than a non-linear regression on the empiric mean
of a variable by minimizing sum of squares. In this calculation, we use the Chain Rule
of a composite function. In this scenario,
∂(γkli − g(rTxsik))2)
∂rj
=
∂(γkli − g(rTxsik))2
∂(γkli − g(rTxsik))
· ∂(γ
kl
i − g(rTxsik))
∂(g(rTxsik))
· ∂(g(r
Txsik))
∂(rTxsik)
· ∂(r
Txsik)
∂rj
(B.12)
The vector derivative is just the extrapolation of this using the definition below.
∂f(x)
∂x
=
[
∂f(x)
∂x1
∂f(x)
∂x2
. . .
∂f(x)
∂xn
]
(B.13)
Then,
∂E∗
∂r
=
[
∂E∗
∂r1
∂E∗
∂r2
. . .
∂E∗
∂rn
]
=
1
σ2γ
∑
i,k
(
E∗{δlij}(γ
k
i )− g(rTxsik)
)
g′(rTxsik)p
k
i
(B.14)
The term xsik is a vector, which turns this result into a vector. The remainder
B.2. Derivatives of Expectation of Log-Likelihood for CAP 119
term is the same for all rj, which makes this formula easily extensible to vector case.
The formula for h is similar.
In order to calculate the second order derivative by an index j′ of r, we have now
a derivative of a vector by a scalar. This is defined as deriving each dimension j, which
is ∂E∗
rj
. Considering the dimensions j and j′,
∂2E∗
∂rj∂rj′
=
∂
∂rj′
1
σ2γ
∑
i,k
(
E∗{δlij}(γ
k
i )− g(rTxsik)
)
g′(rTxsik)p
k
ij
=
1
σ2γ
∑
i,k
∂(E∗{δlij}(γki )− g(rTxsik))
∂rj′
g′(rTxsik) + (E
∗
{δlij}(γ
k
i )− g(rTxsik))
∂g′(rTxsik)
∂rj′
pkij
=
1
σ2γ
∑
i,k
(
(−1)g′(rTxsik)pkij′g′(rTxsik) + (E∗{δlij}(γ
k
i )− g(rTxsik))g′′(rTxsik)pkij′
)
pkij
=
1
σ2γ
∑
i,k
(
(E∗{δlij}(γ
k
i )− g(rTxsik))g′′(rTxsik)− (g′(rTxsik))2
)
pkijp
k
ij′
(B.15)
We used the Product Rule in this differentiation. The first generalization consists
of the second derivative in one dimension of the full vector.
∂2E∗
∂r∂rj′
=
1
σ2γ
∑
i,k
(
(E∗{δlij}(γ
k
i )− g(rTxsik))g′′(rTxsik)− (g′(rTxsik))2
)
pkij′xsik (B.16)
Considering pki to be a column vector, we have the first derivative as a column
vector and the combination of columns compose a matrix. Each column of the matrix
is multiplied by the corresponding pkij′ .
∂2E∗
∂r∂rT
=
1
σ2γ
∑
i,k
(
(E∗{δlij}(γ
k
i )− g(rTxsik))g′′(rTxsik)− (g′(rTxsik))2
)
xsikxsik (B.17)
Since we consider xsik to be a column vector, the multiplication of a column
vector by a row vector results in the desired value pkijpkij′ in position (j, j′) of the
second derivative matrix.
We just have to define, now, the values of the derivatives of the sigmoid function
(f) and the first and second order derivatives are completely defined.
g′(x) =
ex
(1 + ex)2
(B.18)
g′′(x) =
ex(ex − 1)
(1 + ex)3
(B.19)
120 Appendix B. Mathematical Proofs
B.3 Derivatives of Loss Function for BETF
As stochastic gradient descent only regards the derivative of a single example, we derive
such single-instance cost. We just consider the optimization formula only regarding a
single instance hikl present in the training set.
EiklBETF =
1
2
(yhikl − phikl)2 +
1
2
(yrkl − prkl)2 +
λu
2
‖ui‖2Fro +
λv
2
‖vk‖2Fro
+
λp
2
‖pl‖2Fro +
λs
2
‖sikl‖2Fro
(B.20)
In order to apply stochastic gradient descent, we have to find the derivative of
this expression related to ui, vk, pl and sikl. We compute first for ui.
∂EiklBETF
∂ui
=
1
2
2(qij − g( ˆqikl))g′( ˆqikl)∂hikl
∂ui
+
λu
2
2ui
= (qij − g( ˆqikl))g′( ˆqikl)(s×v vj ×p pk) + λuui,
(B.21)
where we calculate ∂hikl
ui
as follows.
∂hikl
uid
=
∂(
∑K
x=1 uix
∑K
y=1
∑K
z=1 sxyzvjypkz)
∂uid
=
∂(uid
∑K
y=1
∑K
z=1 sdyzvjypkz)
∂uid
=
K∑
y=1
K∑
z=1
sdyzvjypkz
=
K∑
z=1
K∑
y=1
sdyzvjypkz
=
K∑
z=1
pkz
K∑
y=1
sdyzvjy
=
K∑
z=1
pkz(s×v vj)dz
= (s×v vj ×p pk)d
∂hikl
ui
= s×v vj ×p pk
(B.22)
Which is a (K, 1) vector. For vj and pk, we also have terms related to the matrix
of ratings.
B.3. Derivatives of Loss Function for BETF 121
∂EiklBETF
∂vj
= (qij − g( ˆqikl))g′( ˆqikl)(s×u ui ×p pk) + (ojk − g(oˆjk))g′(oˆjk)pk + λvvj
(B.23)
∂EiklBETF
∂pk
= (qij − g( ˆqikl))g′( ˆqikl)(s×u ui ×v vj) + (ojk − g(oˆjk))g′(oˆjk)vj + λppk
(B.24)
(B.25)
In the algorithm proposed, differently from typical stochastic gradient descent,
each derivative term is updated separately, avoiding multiple updates regarding the
same related entities. For example, the second term of the last derivative is related
to author and product pair, which repeats for all votes given to the review written by
the author to the product. The same happens with regularization terms, which would
be updated once for each helpfulness vote, in each iteration. Instead, the proposed
algorithm update terms from different sources separately: first, regarding voter error;
second, regarding rating error; and finally, regarding regularization. This way, vote
error is prioritized over rating error and regularization. Although we present derivative
terms together, they are updated separately.
Given the definition hˆikl = µh + h¯ui + h¯rj + s ×u ui ×v vk ×p pl, we have the
following:
∂hikl
sxyz
= uixvjypkz (B.26)
So, each dimension is multiplied by the corresponding dimension in u, v and p.
Thus, we can extended to the whole tensor case.
∂hikl
s
= ui ⊗ vj ⊗ pk (B.27)
Where ⊗ is the outer product of tensors resulting in a new one with the sum
of both dimensions [Moghaddam et al., 2012]. Consequently, we have the following
derivative regarding the objective function.
∂EiklBETF
∂s
= (qij − g( ˆqikl))g′( ˆqikl)(ui ⊗ vj ⊗ pk) + λss (B.28)
122 Appendix B. Mathematical Proofs
B.4 Derivatives of Loss Function for CoRaSVR
We first focus in the derivative of the regression portion of ECoRaSV R, represented by
∂rk , related to a review rk:
∂rk =
∂ (γ yrkmax{|prk − yrk | − ε), 0})
∂w
=

∂(γ yrk (|prk−yrk |−ε))
∂w
|prk − yrk | > ε
∂(γ yrk0)
∂w
|prk − yrk | ≤ ε
=

∂(γ yrk ((prk−yrk )−ε))
∂w
prk − yrk > ε
∂(γ yrk ((yrk−prk )−ε))
∂w
prk − yrk < −ε
∂(γ yrk0)
∂w
|prk − yrk | ≤ ε
=

γ yrkxrk prk − yrk > ε
−γ yrkxrk prk − yrk < −ε
0 |prk − yrk | ≤ ε
(B.29)
Note that the derivative of constants yrk and ε are 0, and the derivative of prk is
xrk as prk = wTxrk .
We now derive the term related to cross entropy loss for pairwise probability,
which we denote ∂ri,rj , related to a pair of reviews ri, rj.
∂ri,rj =
∂
(
(1− γ) yri,rj
(
log(1 + epri−prj )− yri,rj(pri − prj)
))
∂w
= (1− γ) yri,rj
1
epri−prj
∂(1 + epri−prj )
∂w
− yri,rj(xri − xrj)
= (1− γ) yri,rj
1
epri−prj
epri−prj (xri − xrj)− yri,rj(xri − xrj)
= (1− γ) yri,rj
(
1
e−(pri−prj )
− yri,rj
)
(xri − xrj)
= (1− γ) yri,rj
(
g(pri − prj)− yri,rj
)
(xri − xrj)
(B.30)
where ex
1+ex
= 1
1+e−x = g(x).
Appendix C
Probabilistic Graphical Models of
Review Recommender Systems
The probabilistic graphical model (PGM) for BETF and CAP are presented in this
Chapter. Such models help understanding the relationships among latent and observed
variables of each method. It also allows to obtain the likelihood by factorizing over the
PGM.
hik   
σh
ui
hu
vk
rkl σr
hr
pl
ra
rp
i
j
k
l
σu
σp
σv
l
Figure C.1: PGM for BETF. Rectangles represent the cardinality of the given vari-
able. Observed variables are represented by dark circles and latent, by light ones. We
use hij as yhij .
123
124
Appendix C. Probabilistic Graphical Models of Review Recommender
Systems
hij
αi
ui
d
yi
W
ξk
b
βj vj
g xj V
ઠ1
ઠ2
r
σh
σα
Au
σβ Av σξ σᶕ
zk
ᶕik
λik
h
σλ
qik
pik
Figure C.2: PGM for CAP. The notation is the similar to Figure C.1.