The gold standard is composed by a set of 5 short videos from the BBC One Minute World News Website which have been annotated with a list of Named Entities that illustrate the different facts depicted in them. Each named entity has an associated score normalized from -1 to 1 in order to indicate its importance of that entity inside the whole context of the video.
- Entities and scores per video are available in folder data
- Metodology is explained in Section Metodology below
- Videos and transcripts are included in corresponding folders.
- You can read our paper in how to automatically recreate those entities by relying on a Web document expansion approach.
- Gold Standard data is also shown below.
We selected 5 short videos from the BBC One Minute World News Website . Each video lasted from 1 to 3 minutes. The selection covered a wide range of subjects specifically: politics, armed conflicts, environmental events, legal disputes, and social news. The intention behind this topic choice was to fit international audiences, since we planned to perform a user study with international participants. Subtitles of the videos were not available; therefore, a member of the team manually transcribed the speech in the videos. After obtaining the transcriptions, the following steps were performed in order to obtain a set of unbiased candidate entities. We chose to focus only on entities of the types person, organisation and location because they can directly translate or answer three questions: Who is involved? What happened? Where did it take place? These questions are a subset taken from a well known concept in journalism known as the 5Ws (who, what, when, where and why), which emphasizes the fundamental dimensions that an informative journalistic text should report on. Regarding the questions when and why, there where discarded because in order to acquire meaningfulness they need to be modeled not only by single entities but also by more complex relations between them that are out of the scope of the current paper.
All entities of the type person, organisation and location were manually extracted from each one of the video subtitles and added to the unfiltered list of entities (candidate set).
The video image was visually analysed by a researcher and every time a recognisable person, organisation or location was portrayed this was also added as an entity to the candidate set.
The video was analysed for text appearing in the image. Whenever text appeared in the video image, for example in the form of nametag overlays, the named entities appearing in such tags were added to the candidate set.
In order to complement the candidate set with entities that might be interesting for the user, but are not necessarily found in the videos, we used the following two strategies:
We collaborated with a journalist with more than 6 years of experience as a writer/editor for important American newspapers and websites. We configured an online survey to retrieve the expert’s feedback. In the survey we explained what named entities are and which types of named entities we needed. After this introduction we presented the videos to the expert. Following each one we asked him to list the named entities that, according to his criteria, would better serve the objective of showing interesting additional information to the users. The expert didn’t have access to the candidate set, and was completely free to suggest any named entity he wanted.
We used Google custom search to look for articles related to the video in three news sources: The Guardian, New York Times, and Al Jazeera online (English). We performed this search using the main terms in the videos’ title, for example, for Fugitive Edward Snowden applies for asylum in Russia we searched for Edward + Snowden + asylum + Russia. We limited the results to + - 3 days from the day when the video was published. We chose one document from each source, the one closest in topic and time to the video. We then extracted all named entities of the types person, organisation and location from the resulting documents. In order to keep the entities within a reasonable number for inclusion in a survey, we kept only the named entities that appeared in at least 2 related articles and dropped all the ones that only appeared in one. The selected entities were added to the candidate set.
We refined the candidate set comprised of all found entities by eliminating all duplicated named entities and standardising names. For example, when we had Barack Obama'' as an entity and
Obama'' as another entity we eliminated the shorter one and left the complete name.
A total of 99 entities were obtained from all videos. For a distribution of entities among types and videos please see next Table :
Newscast Title | Date | Person | Organization | Location | Total |
---|---|---|---|---|---|
Fugitive Edward Snowden applies for asylum in Russia | 2013-07-03 | 11 | 7 | 10 | 28 |
Egypt’s Morsi Vows to Stay in Power | 2013-07-23 | 4 | 5 | 4 | 17 |
Fukushima leak causes Japan concern | 2013-07-24 | 7 | 5 | 5 | 13 |
Rallies in US after Zimmerman Verdict | 2013-07-17 | 9 | 2 | 8 | 19 |
Royal Baby Prince Named George | 2013-07-15 | 15 | 2 | 6 | 22 |
Total | 46 | 20 | 33 | 99 |
We created an online survey with the aim of gathering information about the degree of interestingness of the entities in the candidate set. Based in \cite{vonBrzeski:2007:LCU:1321440.1321537} we define interestingness as to whether an entity is interesting, useful or compelling enough to tear the user away from the main thread of the document.
Fifty international subjects participated in this online study. They responded an online call distributed via email and social networks. Their age range was between 25 and 54 years with an average age of 30.3 (standard deviation 7.3 years). 18 participants were female and 32 were male. Most of the participants were highly educated and 48 of them had either a university bachelor degree or a postgraduate degree. The main requisite for participation was that they were interested in the news and followed the news regularly, preferably through means that include newscasts. During the interview participants were asked to choose at least 3 out of 5 videos according to their preferences. Then they were shown each one of the videos. After each video, they were asked to rate whether they would be interested in receiving more information about the named entities in the context of the news video and on a second screen or similar application. All the named entities from the candidate set related to the last seen video were shown in a list with ratio buttons arranged in a similar way to a three-point Likert-scale. The possible answers were Yes Maybe and No.
The number of respondents per video were: Snowden 49, Morsi 34, Fukushima 42, Zimmerman 27, and Royal baby 15.
In order to calculate the interestingness scores from the users’ responses, we gave every answer a numerical value: Yes = 1, Maybe = 0 and No = -1. We then obtained an average score for each entity using the number of participants that rated an entity and the score that each participant gave to that particular entity. This average was used to obtain a ranking of all the entities in the candidate set according to user preferences.
With the objective of finding out users preferences regarding the three analysed entity types: person, organization and location, we calculated an average rating for each one of the entity types. The results in descending order were: organisation = -0,05, person = -0,24 and location = -0,52. These results clearly show a preference from users for the entities of the type organisation and person over those of the type location. \end{document}
Here the different entities that compose the
Entity Label | Score |
---|---|
Edward Snowden | 0.5555556 |
National Security Agency | 0.3777778 |
Central Intelligence Agency | 0.0666667 |
Glenn Greenwald | 0.0666667 |
Foreign Intelligence Surveillance Court | 0 |
Booz Allen Hamilton | -0.1333333 |
Laura Poitras | -0.2666667 |
Barton Gellman | -0.3111111 |
Verizon Wireless | -0.3333333 |
Angela Merkel | -0.3777778 |
Anatoly Kucherena | -0.4 |
Federal Migration Service | -0.4444444 |
US State Department | -0.4444444 |
Vladimir Putin | -0.4666667 |
Sheremetyevo International Airport | -0.4666667 |
Nicaragua | -0.4888889 |
Evo Morales | -0.4888889 |
Patrick Ventrell | -0.5111111 |
Moscow | -0.5555556 |
Bolivia | -0.5777778 |
Kremlin | -0.6 |
Russia | -0.6222222 |
Venezuela | -0.6222222 |
United States of America | -0.6666667 |
Latin America | -0.7333333 |
Barack Obama | -0.7555556 |
Jay Carney | -0.7555556 |
South America | -0.8 |
URL: http://www.bbc.com/news/world-middle-east-23156103
Entity Label | Score |
---|---|
Muslim Brotherhood | 0.441176471 |
Mohamed Morsi | 0.352941176 |
Islamist Nour Party | 0.323529412 |
Arab Spring | 0.264705882 |
Egypt opposition alliance | 0.235294118 |
Abdul Fattah Sisi | 0.088235294 |
Hosni Mubarak | -0.029411765 |
Adly Mansour | -0.176470588 |
Tahrir Square | -0.235294118 |
Mohamed ElBaradei | -0.235294118 |
Supreme Constitutional Court | -0.294117647 |
Egypt | -0.352941176 |
Cairo | -0.470588235 |
Cairo University | -0.470588235 |
Nasr | -0.529411765 |
Jeremy Bowen | -0.705882353 |
Nisha Pilla | -0.882352941 |
URL: http://www.bbc.com/news/world-asia-pacific-23425570
Entity Label | Score |
---|---|
Fukushima Daiichi | 0.5 |
Tokyo Electric Power Company | 0.476190476 |
Reactor building no 3 | 0.404761905 |
Nuclear Regulation Authority | 0.142857143 |
Pacific Ocean | -0.357142857 |
Nuclear Regulatory Commission | -0.357142857 |
Japan | -0.380952381 |
Naomi Hirose | -0.476190476 |
Tokyo | -0.5 |
Shunichi Tanaka | -0.523809524 |
Shinzo Abe | -0.547619048 |
Rupert Wingfield Hayes | -0.785714286 |
TV Tokyo | -0.880952381 |
URL: http://www.bbc.com/news/world-us-canada-23312559
Entity Label | Score |
---|---|
George Zimmerman | 0.444444444 |
Trayvon Martin | 0.407407407 |
National Association for the Advancement of Colored People (NAACP) | -0.037037037 |
Rachel Jeantel | -0.185185185 |
General Eric Holder | -0.222222222 |
Rev. Lowman Oliver | -0.259259259 |
Mark O'Mara | -0.259259259 |
Sanford | -0.333333333 |
Angela Corey | -0.37037037 |
United States Department of Justice | -0.407407407 |
Matt Gutman | -0.555555556 |
Barack Obama | -0.592592593 |
New York | -0.666666667 |
Florida | -0.666666667 |
Los Angeles | -0.703703704 |
San Francisco | -0.740740741 |
Times Square | -0.740740741 |
Washington DC. | -0.814814815 |
United States of America | -0.851851852 |
URL: http://www.bbc.com/news/uk-23429644
Entity Label | Score |
---|---|
Kate Middleton Duchess of Cambridge | 0.357142857 |
King George the VI Bertie | 0.357142857 |
Prince George of Cambridge George Alexander Louis | 0.285714286 |
Prince William Duke of Cambridge | 0.285714286 |
King George V | 0.142857143 |
Queen Elizabeth | 0.071428571 |
Earl Louis Mountbatten | -0.142857143 |
Prince Harry of Wales | -0.214285714 |
Charles Prince of Wales | -0.285714286 |
Queen Victoria | -0.285714286 |
Balmoral Castle | -0.285714286 |
Princess Margaret | -0.357142857 |
Cambridge | -0.357142857 |
Queen Mary of Teck | -0.357142857 |
Kensington Palace | -0.428571429 |
The Queen Mother Elizabeth Angela Marguerite Bowes-Lyon | -0.428571429 |
Irish Republican Army | -0.428571429 |
St Mary's Hospital | -0.5 |
England Country | -0.5 |
Great Britain | -0.5 |
Suzannah Lipscomb | -0.571428571 |