Why analyze online contents popularity? In the digital era, everyone is competing for attention. Our goal is to analyze the relationship between online contents and their popularity.
NYT data is ideal for performing the above task. Every single NYT article gets tweeted on NYT's Twitter account as a separate tweet, making it possible to measure the popularity level of each article by considering number of Likes, number of Retweets, and number of Comments on each tweet.
The idea was inspired by the following paper: What makes online content viral?
- GetOldTweets3 package was used to scrape basic information regarding each article and the corresponding popularity level from the @nytimes Twitter acount.
- Hand-written web scraper was built to extract features from each article. In building the web scraper, we mostly used regular expression and BeautifulSoup4.
- Then we performed feature engineering to obtain features such as the sentiment polarity (positive vs. negative) of each article.
- Lastly, we observed the relationship between the gathered features and the popularity measure, measured by number of Likes, number of Retweets, and number of Comments of each Tweet.
Approximately 100 articles are released each day. We initially analyzed data from 2016-04-01 through 2016-07-01.
To be organized..
- PM: Elaine Pak (member of Data Mining Center, Seoul National University)
- Interns: Sunbin Kwon, Hyeonjin Kim, Jaehyeon Nam, Yongjae Lee, Jaesung Lee, Hanyong Lee