Keywords: Recommender systems, Artificial Intelligence, Evaluation, Natural language processing, chatbots
Introduction and Background
Conversational Recommender Systems (CRSs) have gained attention in recent years as they support providing users with personalized recommendations in a conversational manner. These systems engage in task-oriented, multi-turn dialogues with users, facilitating a deeper understanding of individual preferences and providing item suggestions, and explanations [1]. The concept of CRSs has roots dating back to the late 1970s, as envisioned by Rich [2], who foresaw a computerized librarian interacting with users through natural language to make reading suggestions based on personal preferences.Â
Over the years, various interaction approaches have emerged, including form-based interfaces and critiquing methods, which allow users to apply pre-defined critiques to recommendations. Form-based approaches offer clear and unambiguous user actions, though they can sometimes feel less natural [3, 4]. In contrast, NLP-based approaches turned out to gain substantial progress, particularly in voice command processing, thanks to advancements in language technology and artificial intelligence (AI). Such technological advancement has led to the widespread use of voice-activated devices and the rapid adoption of chatbots across diverse application domains, including customer service. However, assessing the quality of a highly interactive system like a CRS remains a challenging task.Â
In this blog post, we delve into the challenges and methodologies associated with evaluating CRS and discuss the need for future research in this domain.
The Diverse Landscape of CRS Evaluation
Our review of the existing literature reveals a diverse landscape of evaluation methodologies and metrics for CRSs. Unlike traditional recommender systems, CRSs require user-centric evaluation frameworks. Some propose using general user-centric evaluation frameworks as a basis, but these frameworks are not widely adopted, and no standardized extensions have been proposed in this regard. This lack of standardization makes it difficult to compare different CRS proposals effectively.
Objective Measures and Their Limitations
Mainly researchers resort to objective measures, such as accuracy, recall, or RMSE to evaluate CRSs. While these measures provide valuable insights, but the diversity among CRSs, such as application domains, interaction strategies, and background knowledge, poses a significant challenge for making meaningful comparisons. For instance, a CRS designed for restaurant recommendations may perform differently from one focused on movie suggestions, making direct accuracy comparisons problematic.
The Issue with BLEU Scores
In Natural Language Processing (NLP)-based systems, the BLEU score is commonly used for automatic evaluation. However, studies have shown that BLEU scores may not necessarily correlate well with user perceptions, especially at the response level [5, 6]. The evaluation of language models, in general, is considered challenging, and this complexity is amplified in the case of task-oriented systems like CRS.
Subjective Evaluations
Given the limitations of objective measures, researchers often turn to subjective evaluations. These may include offline experiments with simulated users or user studies. In offline studies, a (hypothetical) user is simulated, answering questions or providing feedback on explanations. However, this approach assumes that users have fixed preferences, which may not always be the case in real-world interactions.
User studies, on the other hand, attempt to mimic realistic decision scenarios. While they remain somewhat artificial, they come closer to real-world situations than offline experiments. However, relying solely on these user studies has its limitations, especially in the context of complex user interactions that CRS systems aim to support.
Understanding User-System Interaction
To address these challenges, more research is needed to understand how humans make recommendations to each other in conversation and how users interact with intelligent assistants. This involves exploring the expectations users have regarding intelligent assistants and the kind of intelligence attributed to them.
Some progress has been made in understanding conversational patterns, but more work is required to comprehend the impact on the user’s perception when certain communication patterns, like explanations or information adequacy for system recommendations, are not supported. This aspect is crucial as many CRS systems struggle with providing comprehensive explanations for their recommendations.
Conclusion
Evaluating Conversational Recommender Systems presents a unique set of challenges, given the diverse nature of these systems and the complexity of human-computer interactions. While objective measures like accuracy and BLEU scores offer insights, they are insufficient on their own. Subjective evaluations, including user studies, provide a more realistic perspective but also have limitations.
To advance the field of CRSs, future research should focus on standardizing evaluation frameworks, understanding user-system interactions, and developing more effective evaluation methods that align with the conversational nature of these systems. As the demand for CRS continues to grow, addressing these challenges will be essential to ensure the development of systems that truly meet users’ needs and expectations.
References
- Jannach, D., Manzoor, A., Cai, W., & Chen, L. (2021). A survey on conversational recommender systems. ACM Computing Surveys (CSUR), 54(5), 1-36.
- Elaine Rich. 1979. User modeling via stereotypes. Cogn. Sci. 3, 4 (1979)
- Frederich N. Tou, Michael D. Williams, Richard Fikes, D. Austin Henderson Jr., and Thomas W. Malone. 1982. RABBIT: An intelligent database assistant. In AAAI’82. 314–318.
- Robin D. Burke, Kristian J. Hammond, and Benjamin C. Young. 1997. The FindMe approach to assisted browsing. IEEE Expert 12, 4 (1997), 32–40.
- Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In EMNLP’16. 2122–2132.
- Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, and Michel Galley. 2018. A knowledge-grounded neural conversation model.