Bridging Comments Benchmark Dataset

This repository currently contains 11,973 comments that have been annotated for attributes that correlate with prosocial or constructive outcomes in online conversation. These attributes are: Reasoning, Curiosity, Respect, Compassion, Alienation, and Moral Outrage

This work is a collaboration among:

Ruta Wheelock and Scott Friedman at SIFT, a research and development consulting company that uses NLP and other technologies to make the information flow between humans and technology better for both sides,
Sonja Schmer- Galunder, Glenn and Deborah Renwick Leadership Professor in AI and Ethics at the University of Florida, and
Zaria Jalan, Alyssa Chvasta, and Emily Saltz as part of the Conversation AI project, a collaborative research effort at Jigsaw exploring ML as a tool for better discussions online, and

Background

Current annotation practices pose many issues, ranging from Western-centric bias, poor working conditions, risks from exploitative power imbalance, and diverse representation among annotators. Within the machine learning community, a focus on data-hungry models and optimization for interrater reliability has led to a focus on data quantity over data quality. However, the process of data-labeling, especially when labeling more complex linguistic constructs like moral justifications of harms, intentionality or constructive conversations, is a highly qualitative task of induction and meaning, often requiring social and cultural knowledge of the context it is embedded in. Definitions for constructs that are theory-driven, albeit well informed in an academic sense, often clash with the intuitive understanding an annotator may have.

In a forthcoming paper, we will describe the results of our annotation work to address some of the problems mentioned above, describing qualitative and quantitative methods for increasing interrater reliability while improving conceptual understanding as well as taking the situatedness of annotation workers and their working conditions into consideration. We publish here the resulting benchmark dataset for assessing constructive conversations.

Methods

Curation and labeling

The dataset is composed of 11,973 comments from Civil Comments, a publicly available dataset of comments from independent and international news sites that were created from 2015–2017. The data was labeled for six attributes: constructive, curiosity, respect, empathy, alienation, and moral outrage. Due to the low prevalence of these attributes, the data that was annotated was first scored by a proprietary model and then filtered by score to ensure a higher proportion of in-class comments. The data was sent to a pool of 7 annotators in three batches which allowed for iterative data sampling improvements as time went on, and later batches of the data constrain the minimum and maximum text length and limit the amount of text dealing with Canadian politics by dropping the comments containing the terms “Trudeau” and “Canada”. Each comment received 4 annotations frpm the annotator pool. Additionally 698 of the 11973 examples have identity terms labeled within the Civil Comments Dataset.

Definitions

Label	Definition
Reasoning	Makes specific or well-reasoned points to provide a fuller understanding of the topic without disrespect or provocation.
Curiosity	Attempts to clarify or ask follow-up questions to better understand another person or topic.
Respect	Shows deference or appreciation to others, or acknowledges the validity of another person.
Compassion	Expressions of care and concern for others, understanding the feelings or viewpoint of others, including support or condolences.
Alienation	Portrays someone as inferior, implies a lack of belonging, or frames the statement in an us vs. them context.
Moral outrage	Anger, disgust, or frustration directed toward other people or entities who seem to violate the author’s ethical values or standards.

Copyright and license

All data in this repository is made available under the Creative Commons Attribution 0 1.0 Universal license (CC0 1.0 DEED). A full copy of the license can be found at https://creativecommons.org/publicdomain/zero/1.0/

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
LICENSE		LICENSE
README.md		README.md
SIFT_Annotated_12k_taggedPosts.csv		SIFT_Annotated_12k_taggedPosts.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bridging Comments Benchmark Dataset

Background

Methods

Curation and labeling

Definitions

Copyright and license

Bibliography

About

Releases

Packages

License

conversationai/Bridging-Comments-Benchmark-Dataset

Folders and files

Latest commit

History

Repository files navigation

Bridging Comments Benchmark Dataset

Background

Methods

Curation and labeling

Definitions

Copyright and license

Bibliography

About

Resources

License

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Packages