Skip to content

Latest commit

 

History

History
121 lines (88 loc) · 5.26 KB

README.md

File metadata and controls

121 lines (88 loc) · 5.26 KB

DRC News Corpus

Coding Standard Unit Tests Latest Stable Version License

The "DRC News Corpus" is a curated collection of news articles sourced from major media outlets covering a wide spectrum of topics related to the Democratic Republic of Congo (DRC). This dataset encompasses a diverse range of news stories, including but not limited to politics, economy, social issues, culture, environment, and international relations, providing comprehensive coverage of events and developments within the country.

Use Cases:

Researchers, journalists, policymakers, and data enthusiasts interested in understanding the socio-political climate, economic dynamics, and other facets of the DRC will find this dataset valuable. It serves as a resource for sentiment analysis, trend identification, language modeling, and other natural language processing (NLP) tasks.

Efforts have been made to ensure the dataset's integrity and quality by including articles from reputable news outlets. However, users are encouraged to exercise discretion and validate the information independently as journalistic standards and perspectives may vary among sources.

Sources

Source Supported Articles Link Last Crawled
radiookapi.net Yes +100k https://www.radiookapi.net/actualite 2024-10-09
mediacongo.cd Yes +100k https://www.mediacongo.net/ 2024-10-11
beto.cd Yes +30k https://www.beto.cd/ 2024-10-13
actualite.cd Yes NA https://actualite.cd/ NA
7sur7.cd Yes NA https://7sur7.cd NA

Download the dataset

  • timespan : 2004-2023
  • last update : 2023-11-30

DRC News Corpus on Kaggle

Build the dataset

If you want to rebuild the dataset follow the steps bellow :

Installation

git clone https://github.com/bernard-ng/drc-news-corpus.git && cd drc-news-corpus
make build
make start

Database Configuration If you're not using docker, you can configure the database connection in the .env file. then run the following command to create the database schema:

composer corpus:migrations

Usage

See supported sources above. you can also add your own source by extending the Source abstract class. if you want to crawl radiookapi.net you can run the following command:

  1. Crawling
php bin/console app:crawl radiookapi.net

# You can specify a date range to crawl articles.
php bin/console app:crawl politico.cd --date="2022-01-01:2022-12-31"

# You can specify a page range to crawl articles.
php bin/console app:crawl mediacongo.net --page="0:6" 

# You can specify both date and page range.
php bin/console app:crawl actualite.cd --date="2022-01-01:2022-12-31" --page="0:6"

# some sources require a category to crawl articles.
php bin/console app:crawl 7sur7.cd --category=politique

# You can crawl multiple pages in parallel.
php bin/console app:crawl radiookapi.net --parallel=20
  1. Updating
# Update the database with the latest articles.
php bin/console app:update radiookapi.net

Notice that this can take a while depending on the number of articles you want to crawl and will store the articles in the database. running this command in the background is recommended. by default no output is generated, you can add the -v option to see the progress.

nohup php bin/console app:crawl radiookapi.net -v > crawling.log
  1. Statistics
# Get the number of articles in the database.
php bin/console app:stats

Export the dataset

You can export the dataset to a CSV file by running the following command:

php bin/console app:export

# You can specify a date range to export articles.
php bin/console app:export --date="2022-01-01:2022-12-31"

# You can specify a source to export articles.
php bin/console app:export --source=radiookapi.net

# you can specify both date and source.
php bin/console app:export --date="2022-01-01:2022-12-31" --source=radiookapi.net

a CSV file will be generated in the data directory.

Contributors

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Acknowledgment:

The compilation and curation of the "DRC News Corpus" were conducted by Tshabu Ngandu Bernard with the primary objective of facilitating research and analysis related to the Democratic Republic of Congo. I don't forget to cite this repository if you consider to use the data or this software.