We collect and classify multiple evaluation methods for different dialog tasks, start from 2012.
Tasks include:
- Open-domain Dialog
- Task-oriented Dialog
- Dialog Summarization
- Dialog Management
- Dialog State Track
- Dialog Policy
- Knowledge-ground Dialog
- Conversational Search
- Conversational Recommendation
- Others
Modals include:
- Text-based Dialog
- Speech-based Dialog
- Visual-based Dialog
- MultiModal-based Dialog
- Survey on evaluation methods for dialogue systems. Artificial Intelligence Review2021
- Conversational Recommendation: Formulation, Methods, and Evaluation. SIGIR2020
- A review of evaluation techniques for social dialogue systems. ISIAA@ICMI2017
- A Comprehensive Assessment of Dialog Evaluation Metrics. CoRR2021
- How to Evaluate Your Dialogue Models: A Review of Approaches. CoRR2021
- MDD-Eval: Self-Training on Augmented Data for Multi-Domain Dialogue Evaluation. AAAI
- Towards Fair Evaluation of Dialogue State Tracking by Flexible Incorporation of Turn-level Performances. ACL
- What is wrong with you?: Leveraging User Sentiment for Automatic Dialog Evaluation. ACL
- DEAM: Dialogue Coherence Evaluation using AMR-based Semantic Manipulations. ACL
- Probing the Robustness of Trained Metrics for Conversational Dialogue Systems. ACL
- Mismatch between Multi-turn Dialogue and its Evaluation Metric in Dialogue State Tracking. ACL
- Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents. ConvAI@ACL
- Relevance in Dialogue: Is Less More? An Empirical Comparison of Existing Metrics, and a Novel Simple Metric. ConvAI@ACL 2022
- Doctor XAvIer: Explainable Diagnosis on Physician-Patient Dialogues and XAI Evaluation. BioNLP@ACL
- Open-Domain Dialog Evaluation Using Follow-Ups Likelihood. COLING
- Does GPT-3 Generate Empathetic Dialogues? A Novel In-Context Example Selection Method and Automatic Evaluation Metric for Empathetic Dialogue Generation. COLING
- SelF-Eval: Self-supervised Fine-grained Dialogue Evaluation. COLING
- Integrating Pretrained Language Model for Dialogue Policy Evaluation. ICASSP
- A Dependency-Aware Utterances Permutation Strategy to Improve Conversational Evaluation. ECIR
- DialSummEval: Revisiting Summarization Evaluation for Dialogues. NAACL
- Explaining Dialogue Evaluation Metrics using Adversarial Behavioral Analysis. NAACL
- Long-term Control for Dialogue Generation: Methods and Evaluation. NAACL
- Generate, Evaluate, and Select: A Dialogue System with a Response Evaluator for Diversity-Aware Response Generation. NAACL-HLT (Student Research Workshop)
- MultiWOZ 2.4: A Multi-Domain Task-Oriented Dialogue Dataset with Essential Annotation Corrections to Improve State Tracking Evaluation. SIGDIAL
- A Systematic Evaluation of Response Selection for Open Domain Dialogue. SIGDIAL
- Dialogue Evaluation with Offline Reinforcement Learning. SIGDIAL
- Evaluating N-best Calibration of Natural Language Understanding for Dialogue Systems. SIGDIAL
- Evaluation of Off-the-shelf Speech Recognizers on Different Accents in a Dialogue Domain. LREC
- Evaluating the Effects of Embedding with Speaker Identity Information in Dialogue Summarization. LREC
- Design and Evaluation of the Corpus of Everyday Japanese Conversation. LREC
- Evaluating Gender Bias in Film Dialogue. NLDB
- Statistical and clinical utility of multimodal dialogue-based speech and facial metrics for Parkinson's disease assessment. INTERSPEECH
- Which Model is Best: Comparing Methods and Metrics for Automatic Laughter Detection in a Naturalistic Conversational Dataset. INTERSPEECH
- Evaluation of call centre conversations based on a high-level symbolic representation. INTERSPEECH
- Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark. TACL
- A Review of Evaluation Practices of Gesture Generation in Embodied Conversational Agents. IEEE Trans. Hum. Mach. Syst.
- Does Social Presence Increase Perceived Competence?: Evaluating Conversational Agents in Advice Giving Through a Video-Based Survey. Proc. ACM Hum. Comput. Interact
- "I don't know what you mean by 'I am anxious'": A New Method for Evaluating Conversational Agent Responses to Standardized Mental Health Inputs for Anxiety and Depression. TIIS
- Ditch the Gold Standard: Re-evaluating Conversational Question Answering. ACL
- Evaluating the Cranfield Paradigm for Conversational Search Systems. ICTIR
- Evaluating Mixed-initiative Conversational Search Systems via User Simulation. WSDM
- FlowEval: A Consensus-Based Dialogue Evaluation Framework Using Segment Act Flows. CoRR
- Report from the NSF Future Directions Workshop on Automatic Evaluation of Dialog: Research Directions and Challenges. CoRR
- MME-CRS: Multi-Metric Evaluation Based on Correlation Re-Scaling for Evaluating Open-Domain Dialogue. CoRR
- Interactive Evaluation of Dialog Track at DSTC9. CoRR
- EnDex: Evaluation of Dialogue Engagingness at Scale. CoRR
- FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation. CoRR
- End-to-End Evaluation of a Spoken Dialogue System for Learning Basic Mathematics. CoRR
- Bipartite-play Dialogue Collection for Practical Automatic Evaluation of Dialogue Systems. CoRR
- CGoDial: A Large-Scale Benchmark for Chinese Goal-oriented Dialog Evaluation. CoRR
- Analyzing and Evaluating Faithfulness in Dialogue Summarization. CoRR
- ED-FAITH: Evaluating Dialogue Summarization on Faithfulness. CoRR
- INFACT: An Online Human Evaluation Framework for Conversational Recommendation. CoRR
- Evaluation of Automated Speech Recognition Systems for Conversational Speech: A Linguistic Perspective. CoRR
- Evaluating Data-Driven Co-Speech Gestures of Embodied Conversational Agents through Real-Time Interaction. CoRR
- Evaluating Conversational Recommender Systems.CoRR
- Conversation Graph: Data Augmentation, Training and Evaluation for Non-Deterministic Dialogue Management. TACL
- Meta-evaluation of Conversational Search Evaluation Metrics. TIS
- D-Score: Holistic Dialogue Evaluation Without Reference. TASLP
- How Am I Doing?: Evaluating Conversational Search Systems Offline. TIS
- Preserving Conversations with Contemporary Holocaust Witnesses: Evaluation of Interactions with a Digital 3D Testimony. CHI Extended Abstracts
- Heuristic Evaluation of Conversational Agents. CHI
- "How Robust R U?": Evaluating Task-Oriented Dialogue Systems on Spoken Conversations. ASRU
- POSSCORE: A Simple Yet Effective Evaluation of Conversational Search with Part of Speech Labelling. CIKM
- Evaluating Human-AI Hybrid Conversational Systems with Chatbot Message Suggestions. CIKM
- Enhancing the Open-Domain Dialogue Evaluation in Latent Space. ACL Findings
- RedditBias: A Real-World Resource for Bias Evaluation and Debiasing of Conversational Language Models. ACL
- Towards a more Robust Evaluation for Conversational Question Answering. ACL
- Improving Automated Evaluation of Open Domain Dialog via Diverse Reference Augmentation. ACL Findings
- REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation Metrics for Open-domain Dialog Generation. ACL Findings
- What Did You Refer to? Evaluating Co-References in Dialogue. ACL Findings
- RADDLE: An Evaluation Benchmark and Analysis Platform for Robust Task-oriented Dialog Systems. ACL
- A Human-machine Collaborative Framework for Evaluating Malevolence in Dialogues. ACL
- LEGOEval: An Open-Source Toolkit for Dialogue System Evaluation via Crowdsourcing. ACL demo
- Towards Quantifiable Dialogue Coherence Evaluation。 ACL
- DynaEval: Unifying Turn and Dialogue Level Evaluation. ACL
- Hierarchical Dependence-aware Evaluation Measures for Conversational Search. SIGIR
- The Interplay of Task Success and Dialogue Quality: An in-depth Evaluation in Task-Oriented Visual Dialogues. EACL
- Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach. EMNLP
-
$Q2$ : Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering. EMNLP - NDH-Full: Learning and Evaluating Navigational Agents on Full-Length Dialogue. EMNLP
- Building and Evaluating Open-Domain Dialogue Corpora with Clarifying Questions. EMNLP
- Large-Scale Quantitative Evaluation of Dialogue Agents' Response Strategies against Offensive Users. SIGDIAL
- How "open" are the conversations with open-domain chatbots? A proposal for Speech Event based evaluation. SIGDIAL
- Contrastive Response Pairs for Automatic Evaluation of Non-task-oriented Neural Conversational Models. SIGDIAL
- Simulating User Satisfaction for the Evaluation of Task-oriented Dialogue Systems. SIGIR
- Non-goal oriented dialogue agents: state of the art, dataset, and evaluation. Artif. Intell. Rev
- An Evaluation of Chinese Human-Computer Dialogue Technology. Data Intell.
- CoCo: Controllable Counterfactuals for Evaluating Dialogue State Trackers. ICLR
- WeChat AI's Submission for DSTC9 Interactive Dialogue Evaluation Track. CoRR
- On the Use of Linguistic Features for the Evaluation of Generative Dialogue Systems. CoRR
- Towards Quantifiable Dialogue Coherence Evaluation. CoRR
- Improving Computer Generated Dialog with Auxiliary Loss Functions and Custom Evaluation Metrics. CoRR
- Naturalness Evaluation of Natural Language Generation in Task-oriented Dialogues using BERT. CoRR
- Investigating the Impact of Pre-trained Language Models on Dialog Evaluation. CoRR
- Automatic Evaluation and Moderation of Open-domain Dialogue Systems. CoRR
- User Response and Sentiment Prediction for Automatic Dialogue Evaluation. CoRR
- Evaluate On-the-job Learning Dialogue Systems and a Case Study for Natural Language Understanding. CoRR
- Evaluating Predictive Uncertainty under Distributional Shift on Dialogue Dataset. CoRR
- Evaluating Pretrained Transformer Models for Entity Linking in Task-Oriented Dialog. CoRR
- A Conceptual Framework for Implicit Evaluation of Conversational Search Interfaces. CoRR
- An Automated Quality Evaluation Framework of Psychotherapy Conversations with Local Quality Estimates. CoRR
- Is my agent good enough? Evaluating Embodied Conversational Agents with Long and Short-term interactions. CoRR
- Evaluating Trust in the Context of Conversational Information Systems for new users of the Internet. CoRR
- Improving Dialog Evaluation with a Multi-reference Adversarial Dataset and Large Scale Pretraining. TACL
- PONE: A Novel Automatic Evaluation Metric for Open-domain Generative Dialogue Systems. TIS
- How to Evaluate Single-Round Dialogues Like Humans: An Information-Oriented Metric. TASLP
- Predictive Engagement: An Efficient Metric for Automatic Evaluation of Open-Domain Dialogue Systems. AAAI
- Studying the Effects of Cognitive Biases in Evaluation of Conversational Agents. CHI
- A Conversational Agent to Improve Response Quality in Course Evaluations. CHI Extended Abstracts
- Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation. ACL
- USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation. ACL
- Towards Holistic and Automatic Evaluation of Open-Domain Dialogue Generation. ACL
- Can You Put it All Together: Evaluating Conversational Agents' Ability to Blend Skills. ACL
- Learning an Unreferenced Metric for Online Dialogue Evaluation. ACL
- Evaluating Dialogue Generation Systems via Response Selection. ACL
- Designing Precise and Robust Dialogue Response Evaluators. ACL
- uBLEU: Uncertainty-Aware Automatic Evaluation Method for Open-Domain Dialogue Systems. ACL student
- ConvLab-2: An Open-Source Toolkit for Building, Evaluating, and Diagnosing Dialogue Systems. ACL demo
- Voiceai Systems to NIST Sre19 Evaluation: Robust Speaker Recognition on Conversational Telephone Speech. ICASSP
- Semantic Diversity for Natural Language Understanding Evaluation in Dialog Systems. COLING Industry
- Evaluating Cross-Lingual Transfer Learning Approaches in Multilingual Conversational Agent Models. COLING (Industry)
- Language Model Transformers as Evaluators for Open-domain Dialogues. COLING
- Deconstruct to Reconstruct a Configurable Evaluation Metric for Open-Domain Dialogue Systems. COLING
- A Comprehensive Evaluation of Incremental Speech Recognition and Diarization for Conversational AI. COLING
- Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems. EMNLP
- GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems. EMNLP
- Interactive Evaluation of Conversational Agents: Reflections on the Impact of Search Task Design. ICTIR
- Treating Dialogue Quality Evaluation as an Anomaly Detection Problem. LREC
- Evaluation of Off-the-shelf Speech Recognizers Across Diverse Dialogue Domains. LREC
- Evaluation of Argument Search Approaches in the Context of Argumentative Dialogue Systems. LREC
- Towards Unified Dialogue System Evaluation: A Comprehensive Analysis of Current Evaluation Protocols. SIGDIAL
- Unsupervised Evaluation of Interactive Dialog with DialoGPT. SIGDIAL
- Is Your Goal-Oriented Dialog Model Performing Really Well? Empirical Analysis of System-wise Evaluation. SIGDIAL
- FinChat: Corpus and Evaluation Setup for Finnish Chat Conversations on Everyday Topics. INTERSPEECH
- Challenges in the Evaluation of Conversational Search Systems. Converse@KDD
- Evaluating Conversational Recommender Systems via User Simulation. KDD
- A Revised Generative Evaluation of Visual Dialogue. CoRR
- How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for Token-level Evaluation Metrics. CoRR
- Turn-level Dialog Evaluation with Dialog-level Weak Signals for Bot-Human Hybrid Customer Service Systems. CoRR
- Submitting surveys via a conversational interface: an evaluation of user acceptance and approach effectiveness. CoRR
- An Evaluation Protocol for Generative Conversational Systems. CoRR
- SSA: A More Humanized Automatic Evaluation Method for Open Dialogue Generation. IJCNN
- Re-Evaluating ADEM: A Deeper Look at Scoring Dialogue Responses. AAAI
- Probabilistic-Logic Bots for Efficient Evaluation of Business Rules Using Conversational Interfaces. AAAI
- Towards a Metric for Automated Conversational Dialogue System Evaluation and Improvement. INLG
- Importance of Search and Evaluation Strategies in Neural Dialogue Modeling. INLG
- Towards Best Experiment Design for Evaluating Dialogue System Output. INLG
- Towards Coherent and Engaging Spoken Dialog Response Generation Using Automatic Conversation Evaluators. INLG
- Are the Tools up to the Task? an Evaluation of Commercial Dialog Tools in Developing Conversational Enterprise-grade Dialog Systems. NAACL-HLT
- Evaluating and Enhancing the Robustness of Dialogue Systems: A Case Study on a Negotiation Agent. NAACL-HLT
- Evaluating Coherence in Dialogue Systems using Entailment. NAACL-HLT
- Evaluating and Enhancing the Robustness of Retrieval-Based Dialogue Systems with Adversarial Examples. NLPCC
- Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems. NeurIPS
- Investigating Evaluation of Open-Domain Dialogue Systems With Human Generated Multiple References. SIGdial`
- A Crowd-based Evaluation of Abuse Response Strategies in Conversational Agents. SIGdial
- User Evaluation of a Multi-dimensional Statistical Dialogue System. SIGdial
- Automatic evaluation of end-to-end dialog systems with adequacy-fluency metrics. Comput. Speech Lang.
- MusicBot: Evaluating Critiquing-Based Music Recommenders with Conversational Interaction. CIKM
- Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings. CoRR
- Domain-Independent turn-level Dialogue Quality Evaluation via User Satisfaction Estimation. CoRR
- ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons. CoRR
- How to Evaluate the Next System: Automatic Dialogue Evaluation from the Perspective of Continual Learning. CoRR
- Evaluating Older Users' Experiences with Commercial Dialogue Systems: Implications for Future Design and Development.CoRR
- Short Text Conversation Based on Deep Neural Network and Analysis on Evaluation Measure. CoRR
- SIMMC: Situated Interactive Multi-Modal Conversational Data Collection And Evaluation Platform. CoRR
- Multi-domain Conversation Quality Evaluation via User Satisfaction Estimation. CoRR
- RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems. AAAI
- Evaluation of Real-time Deep Learning Turn-taking Models for Multiple Dialogue Scenarios. ICMI
- One "Ruler" for All Languages: Multi-Lingual Dialogue Evaluation with Adversarial Multi-Task Learning. IJCAI
- Evaluating and Complementing Vision-to-Language Technology for People who are Blind with Conversational Crowdsourcing. IJCAI
- Adaboost with Auto-Evaluation for Conversational Models. IJCAI
- Towards a Structured Evaluation of Improv-bots: Improvisational Theatre as a Non-goal-driven Dialogue System. LaCATODA@IJCAI
- Expert Evaluation of a Spoken Dialogue System in a Clinical Operating Room. LREC
- EuroGames16: Evaluating Change Detection in Online Conversation. LREC
- LSDSCC: a Large Scale Domain-Specific Conversational Corpus for Response Generation with Diversity Oriented Evaluation Metrics. NAACL-HLT
- Empirical Evaluation of Character-Based Model on Neural Named-Entity Recognition in Indonesian Conversational Texts. NUT@EMNLP
- A Methodology for Evaluating Interaction Strategies of Task-Oriented Conversational Agents. SCAI@EMNLP
- Topic-based Evaluation for Conversational Bots. CoRR
- On Evaluating and Comparing Conversational Agents. CoRR
- Adversarial evaluation for open-domain dialogue generation. SIGDIAL Conference
- Evaluating Natural Language Understanding Services for Conversational Question Answering Systems. SIGDIAL Conference
- Generating and Evaluating Summaries for Partial Email Threads: Conversational Bayesian Surprise and Silver Standards. SIGDIAL Conference
- Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses. ACL
- Evaluating Persuasion Strategies and Deep Reinforcement Learning methods for Negotiation Dialogue agents. EACL
- Sherlock: Experimental Evaluation of a Conversational Agent for Mobile Information Tasks. IEEE Trans. Hum. Mach. Syst
- Adversarial Evaluation of Dialogue Models. CoRR
- The First Evaluation of Chinese Human-Computer Dialogue Technology. CoRR
- Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation. CoRR
- Evaluating Quality of Chatbots and Intelligent Conversational Agents. CoRR
- Perspectives for Evaluating Conversational AI. CoRR
- Evaluating Visual Conversational Agents via Cooperative Human-AI Games. CoRR
- How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. EMNLP
- Evaluation Dataset (DT-Grade) and Word Weighting Approach towards Constructed Short Answers Assessment in Tutorial Dialogue Context. BEA@NAACL-HLT
- On the Evaluation of Dialogue Systems with Next Utterance Classification. SIGDIAL Conference
- The dialogue breakdown detection challenge: Task description, datasets, and evaluation metrics. LREC
- Automatic creation of scenarios for evaluating spoken dialogue systems via user-simulation. Knowl. Based Syst
- Evaluating Prerequisite Qualities for Learning End-to-End Dialog Systems. ICLR (Poster)
- Interactive Topic Modeling for Exploring Asynchronous Online Conversations: Design and Evaluation of ConVisIT. TIS
- Evaluation of Crowdsourced User Input Data for Spoken Dialog Systems. SIGDIAL Conference
- Evaluating Spoken Dialogue Processing for Time-Offset Interaction. SIGDIAL Conference
- Query Refinement Using Conversational Context: A Method and an Evaluation Resource. NLDB
- Extrinsic Evaluation of Dialog State Tracking and Predictive Metrics for Dialog Policy Optimization. SIGDIAL Conference
- Evaluating a Spoken Dialogue System that Detects and Adapts to User Affective States. SIGDIAL Conference
- Evaluating coherence in open domain conversational systems. INTERSPEECH
- Modeling and evaluating dialog success in the LAST MINUTE corpus. LREC
- Japanese conversation corpus for training and evaluation of backchannel prediction model. LREC
- Network assisted rate adaptation for conversational video over LTE, concept and performance evaluation. CSWS@SIGCOMM
- Evaluation of a Conversation Management Toolkit for Multi Agent Programming. CoRR
- Development and evaluation of spoken dialog systems with one or two agents. INTERSPEECH
- Affective evaluation of multimodal dialogue games for preschoolers using physiological signals. INTERSPEECH
- Evaluating spoken dialogue models under the interactive pattern recognition framework. INTERSPEECH
- Evaluating an adaptive dialog system for the public. INTERSPEECH
- How Was Your Day? Evaluating a Conversational Companion. TAC
- In-Context Evaluation of Unsupervised Dialogue Act Models for Tutorial Dialogue. SIGDIAL Conference
- Evaluation of Speech Dialog Strategies for Internet Applications in the Car. SIGDIAL Conference
- Evaluating State Representations for Reinforcement Learning of Turn-Taking Policies in Tutorial Dialogue. SIGDIAL Conference
- Evaluating a City Exploration Dialogue System with Integrated Question-Answering and Pedestrian Navigation. ACL
- Implementation and evaluation of a multimodal addressee identification mechanism for multiparty conversation systems. ICMI
- Iterative Development and Evaluation of a Social Conversational Agent. IJCNLP
- An Automatic Dialog Simulation Technique to Develop and Evaluate Interactive Conversational Agents. Appl. Artif. Intell
- Practical Evaluation of Human and Synthesized Speech for Virtual Human Dialogue Systems. LREC
- Evaluation of Online Dialogue Policy Learning Techniques. LREC
- Resource Evaluation for Usable Speech Interfaces: Utilizing Human-Human Dialogue. LREC
- Evaluation of the KomParse Conversational Non-Player Characters in a Commercial Virtual World. LREC
- Evaluating expressive speech synthesis from audiobook corpora for conversational phrases. LREC
- Developing and evaluating an emergency scenario dialogue corpus. LREC
- Intrinsic and Extrinsic Evaluation of an Automatic User Disengagement Detector for an Uncertainty-Adaptive Spoken Dialogue System. HLT-NAACL
- Position Paper: Towards Standardized Metrics and Tools for Spoken and Multimodal Dialog System Evaluation. SDCTD@NAACL-HLT
- An End-to-End Evaluation of Two Situated Dialog Systems. SIGDIAL Conference
- Evaluating language understanding accuracy with respect to objective outcomes in a dialogue system. EACL
- Topic identification based extrinsic evaluation of summarization techniques applied to conversational speech. ICASSP
- Conversational evaluation of artificial bandwidth extension of telephone speech using a mobile handset. ICASSP
- Synthesis and evaluation of conversational characteristics in HMM-based speech synthesis. Speech Commun.
- Conversational Evaluation of Speech Bandwidth Extension Using a Mobile Handset. IEEE Signal Process. Lett.
- Designing generalisation evaluation function through human-machine dialogue. CoRR
If you have any questions related to the repository or want to increase any work about dialog evaluation, feel free to open an issue or email Peiyuan Gong ([email protected]).