Skip to content

Latest commit

 

History

History
156 lines (102 loc) · 15.8 KB

README.md

File metadata and controls

156 lines (102 loc) · 15.8 KB

The World through media

Team members : Alice Bizeul, Johan Cattin, Laure Font

Abstract

The starting point of this project was to ask ourselves : how are we connected to the world ? The answer seems obvious. Except in specific occasions where travelling is an option, media is our main window on the rest of the globe, espacially regarding foreign countries where language and distance generate a gap between us and the information. This highlights our vulnerability to any sort of bias media could convey. Another starting point is the actual mood trend which consists of questionning the reliability of medias and their integrity. A few examples are mentionned here below. Using the GDELT 2.0 database, we wish to give insights on the image of the world conveyed by medias and highlight possible bias. This dataset focuses on conflictual events related by media in the world since 2015. As a complementary source of information we will aslo be using the Uppsala Conflict Data Program which traces information about armed conflict worldwide. Raising awareness with datascience on this topic can sharpen, if not society's, our look when grasping information through media.

Bibliography : https://www.lemonde.fr/yemen/video/2017/07/31/yemen-comment-expliquer-que-le-conflit-soit-si-peu-mediatise_5166849_1667193.html https://www.liberation.fr/direct/element/trump-accuse-les-medias-fake-news-dengendrer-la-violence_89483/ https://www.tehrantimes.com/news/428409/Global-media-and-Muslims-Selective-coverage-selective-outrage

Research questions

  • What are the distributions of human activity and media coverage worldwide ?
  • Can media coverage be an indicator of a country's stability or state of activity ? If not, can we extract different parameters that show media coverage bias ?
  • Do the features of an event give an indication of the level of mediatic attention it will receive ? (type of event, number of deaths in armed conflicts, level of internalization, geographic position)
  • If time allows us to, is this bias evenly distributed accross countries ? in other words, is the bias generated by media in Switzerland different as the one generated by foreign media ?

Dataset

For this project we will be using two databases :

  1. The global database of society 2.0 (GDELT) is a database which monitors the information provided by broadcast, prints and web news worldwide. It grasps the information of human activity, processes it to provide different indicators and uploads the database which is therefore updated every 15 minutes. As its size and complexity are huge, data is available on Google's Big Query using standard SQL. The information provided by the database are Event IDs, Date of event, Actors (Code, Name, Country Code, Religion Code, Ethnicity Code, Type of Actor), Mention if the event is a root event or not, Goldstein Scale, Number of Mentions in the News, Number of sources, Average Tone of the mentions, Geographic Information (Country Code, Latitude Longitude etr..) as mentioned in the CookBook (http://data.gdeltproject.org/documentation/GDELT-Event_Codebook-V2.0.pdf)

A second dataset is provided in the GDELT 2.0 program called a Mention Tables, it processes all new mentions of an event in media worldwide and provides several information of all new mentions (length of the mention, tone of the mention for instance). It makes it possible to track an event through media over time and get indications on the type of mention.

Events, Goldstein Scale values are among the data types that are coded under the form of CAMEO index, their corresponding values are mentionned in the tables above:

  1. The Uppsala Conflict Data Program (UPCD) is a database which provides data on organized violence since 1949. The UCDP/PRIO Armed Conflict Dataset version 18.1 provides the general information on armed conflict whereas the UCDP Battle-Related Deaths Dataset provides the estimate of the number of deaths for each armed conflict. The information of interest to use is the location of the conflict, its date as well as the number of deaths associated. All of these information are provided by these databases.

As the GDELT only contains information as from February 2015, the UPCD database will be filtered in order to keep only relevant information.

These information were provided by the corresponding cookbooks :

A list of internal milestones up until project milestone 2

Week 1

  1. Acquisition of the data, extensive exploration, understanding the structure
  2. Update on the questions we would like to answer

Week 2

  1. Extraction of the data of interest and data cleaning
  2. Statistical analysis of the data

Week 3

  1. Find methods to determine correlation between the different parameters we are interested in

Questions for TAs

Google Big Query is another approach in order to fetch the data. What would be the best approach between the Big Query and the cluster to get access to the data ? Can we make use of additional databases (ex : Human Development Index database) if we wish to during the project ?

Milestone 2

Update on Data Acquisition, Selection & Cleaning -

The data acquisition was made through the cluster, a few statistical analysis were performed on the whole dataset to verify that we were able to handle such amount of data. The rest of the statistical analysis performed in the Milestone2.ipynb were tested on a small portion of the data, stored locally and provided in the data folder in the repository. These data sets were spread over time, the first and last files of the database were also integrated in this sampled database.

For data selection we decided to narrow down the amount of data by selecting specific data columns in the export.csv and mentions.csv datasets. The GKG.csv files were left aside as they did not convey relevant information to us in addition to being very heavy.

In export files which contains event related information, we selected :

  • GlobalEventID : only way to connect events stored in export.csv and mentions.csv files
  • DayDate : stores the event occurence day month & year
  • Month_YearDate : stores the month and year of occurence of the event
  • Year_Date : stores the year of occurence of the event
  • Fraction_Date
  • QuadClass : stores the quad class label of each event, categories are Verbal Cooperation, Material Cooperation, Verbal Conflict, Material Conflict; Kept because gives a event type classification
  • GoldsteinScale : stores the potential impact that an event will have on a country's stability, is directly related to the type of event;
  • EventRootCode : stores the type of event, kept because allows a categorisation of event by sub-categories
  • ActionGeo_CountryCode : stores the *****

In mentions files which contains news mentions related information, we selected :

  • GlobalEeventID
  • EventTimeDate : stores date and time of occurence of event in timestamp format
  • MentionTimeDate : stores date and time of occurence of this particular news mention of the event in timestamp format
  • MentionType : Type of source from which the mention comes from (WebSources, Offline Sources ,Archives, Non Written Content); Allows us to assess the origin of our dataset and potentialy better understand results;
  • Confidence : index of confidence in the extraction of informaiont from the news report, the high the index the more we can trust the analysis of the mention that is provided; Allows us to have a better understanding of the dataset;
  • MentionSourceName : name of the news source from which the news mention comes from; Allows us to have a better understanding of the origin of our dataset.

Regarding Data Cleaning, data was already under a statisfying format. The only major cleaning made was on the ActorGeo_CountryCode. Some country index were not under the FIPS_ISO format, which is the format used in the json files that will be used for visualisation of our results on maps. All country index are now in the correct format.

Update on Data Statistical Analysis -

The exploration and analysis performed in the Milestone2.ipynb file followed the following pipeline :

  • Data Fetching, Selection & Cleaning
  • Analysis of Origin of Data : analysis of sources, sources types and confidence level for a better understanding of the data we will be working with
  • Analysis of Human Activity and Mediatic Activity through TIME
  • Analysis of Mentions DataBase and definition of Mediatic Coverage and Mediatic Attention
  • Analysis of Human Activity and Mediatic Activity through SPACE (by country)
  • Analysis of Human Activity and Mediatic Activity through EVENT TYPES
  • Analysis of Human Activity and Mediatic Activity through TIME, SPACE & EVENT TYPES

All of these analysis give insights on the behavior of our data and give the basis of our future analysis which will be centered aroung the dependencies between media attention & coverage and time, space and event types.

Update on research questions -

  • What are the distributions of human activity and media coverage worldwide ? - This question has already been answered in the statistical anlysis, next step is to provide an interactive visualisation of these activities accross time and space using folium maps. A one month time step will be used to allow the user to scroll through time.

  • Can media coverage be an indicator of a country's stability or state of activity ? If not, can we extract different parameters that show media coverage bias ? - The country's stability can easily be assessed using the Goldstein Scale which provides a numerical indication on the potential impact an event will have on the country's stability. This question will be answered by linking each event in the database to its Goldstein Scale value and the amount of mediatic attention each event received in a fixed period following its occurence. A statistical analysis using correlation coefficient will then be performed and visualisation will be added.

  • Do the features of an event give an indication of the level of mediatic attention it will receive ? (type of event, number of deaths in armed conflicts, level of internalization, geographic position) - We decided to leave the UPCD database aside as it provided data per year, which isn't relevant for the time laps we will be assessing (2 years). No analysis will therefore be performed on the precise number of deaths. However a similar analysis can be performed by extracting a similar information from the occurence of violent events. This informaion is extracted from the EventRootCode index. The biais of media towards the geographic position will be assessed by comparing the level of human activity and level of mediatic attention received by various country's accross the globe. The impact of time on the mediatic attention received by 'major events' accross time can also be assessed by visualizing and assessing the evolution of mediatic attention a events receives accross time. In order words characterize the lifespan on events in the mediatic sphere.

  • If time allows us to, is this bias evenly distributed accross countries ? in other words, is the bias generated by media in Switzerland different as the one generated by foreign media ? - This step might be slightly too complicated to perform in the time laps we have ahead of us. In addition, this question requires that we grasp the origin of each news report, this information is stored in the URL of the news mention source and most sources are labelled as international sources making it difficult to accuratly grasp the distribution of the origin of our data.

Next steps

  • Set up interactive maps to visualize accross time and space the evolution of human activity and mediatic coverage based on the data provided by the GDELT 2.0 database.
  • Run statistical correlation studies between Goldstein Scale and Mediatic Attention using Correlation Coefficients
  • Set up the visualistion of the distribution of the mediatic attention of events across time, location of the event or type of event.
  • Set up the visualisations of the evolution of Human Activity and Mediatic Attention accross time for specific relevant country's (US, Syria, Pakistan and Australia) as well as the evolution of the number of relevant event types during the same time laps( violent vs. pacific events).

Milestone 3

Results and visualisations of our project are available on our data story : https://lovelacecrewproject.github.io/ All material used to generate those results as well as Milestone 2 analysis are available through the our Final_notebook.ipynb as well as in the main.py documents which was used to fetch results on the full database via the cluster. Raw results are stored in the data folder.

Update since Milestone 2

This last part of the project was centered around generating results on the full GDELT 2.0 dataset provided via the cluster, generating the appropriate visualisations using this data and according to the points that we wished to explore. Finally, the data story (https://lovelacecrewproject.github.io/ ) was created to summarize our work and obtained results.

Regarding what was planned for Milestone 2 :

  • Set up interactive maps to visualize accross time and space the evolution of human activity and mediatic coverage based on the data provided by the GDELT 2.0 database : several tools was tested to generate those maps (folium,plotly, viz, ...), after several unsuccessfull attempts, our final choice was to use folium to give an overview on the reality conveyed by the dataset.
  • Run statistical correlation studies between Goldstein Scale and Mediatic Attention using Correlation Coefficients : the dataset was restricted to a more smaller time portion to generate those results, the pearson and spearman correlation coefficient were generated to draw our final conclusion regarding the type of relationship present between the Goldstein stability index and the coverage perceived by an event.
  • Set up the visualistion of the distribution of the mediatic attention of events across time, location of the event or type of event : these distributions were combined for 4 countries of interest in order for us to
  • Set up the visualisations of the evolution of Human Activity and Mediatic Attention accross time for specific relevant country's (US, Syria, Pakistan and Australia) as well as the evolution of the number of relevant event types during the same time laps( violent vs. pacific events).
  • If time allows us to, is this bias evenly distributed accross countries ? in other words, is the bias generated by media in Switzerland different as the one generated by foreign media ? : This was a question we set a side by the end of Milestone 2 for time reasons, finally we decided to generate the results for major news sources which aim to relay international and national information in different major countries (France, United Kingdom, United States of America, Indonesia, Isreal, Kenya, Australia, Russia, China, Japan). This allowed us to have insights on the biais generated by a media sources and its localisation. They all aim to convey, in addition to an overview of the national context, a overview of what happends in the world. Despite this common stated goal, results show the vision of the world converyed by our major local media channels are not the same depending on the media source. This is most probably due to the media source itself and the vision it wishes to convey as well as the geographic location of this media which influences its content.

Working repartition

  • Laure : correlation analysis, creation of the textual content, set up of visualisations, oral presentation
  • Alice : creation of the data story, set up of visualisations, creation of read me, oral presentation
  • Johan : handeling the cluster and generation of results on the full database, creation of the poster