Skip to content

Predicting Virality of content from News and Informational websites

License

Notifications You must be signed in to change notification settings

b117020/OnlineNewsPopularity

Repository files navigation

OnlineNewsPopularity

Problem Statement : Crawl news & information websites & anticipate the likelihood of its virality. (Bipolar Factory Assignment)

Dataset Used for Training :

I have used the Online News Popularity Data Set available in the UCI machine learning repository to train my model.

Data Set Information(source):

  • The articles were published by Mashable (www.mashable.com) and their content as the rights to reproduce it belongs to them. Hence, this dataset does not share the original content but some statistics associated with it. The original content be publicly accessed and retrieved using the provided urls.
  • Acquisition date: January 8, 2015
  • The estimated relative performance values were estimated by the authors using a Random Forest classifier and a rolling windows as assessment method. See their article for more details on how the relative performance values were set.
Attribute Information:

Number of Attributes: 61 (58 predictive attributes, 2 non-predictive, 1 goal field)

  1. url: URL of the article (non-predictive)
  2. timedelta: Days between the article publication and the dataset acquisition (non-predictive)
  3. n_tokens_title: Number of words in the title
  4. n_tokens_content: Number of words in the content
  5. n_unique_tokens: Rate of unique words in the content
  6. n_non_stop_words: Rate of non-stop words in the content
  7. n_non_stop_unique_tokens: Rate of unique non-stop words in the content
  8. num_hrefs: Number of links
  9. num_self_hrefs: Number of links to other articles published by Mashable
  10. num_imgs: Number of images
  11. num_videos: Number of videos
  12. average_token_length: Average length of the words in the content
  13. num_keywords: Number of keywords in the metadata
  14. data_channel_is_lifestyle: Is data channel 'Lifestyle'?
  15. data_channel_is_entertainment: Is data channel 'Entertainment'?
  16. data_channel_is_bus: Is data channel 'Business'?
  17. data_channel_is_socmed: Is data channel 'Social Media'?
  18. data_channel_is_tech: Is data channel 'Tech'?
  19. data_channel_is_world: Is data channel 'World'?
  20. kw_min_min: Worst keyword (min. shares)
  21. kw_max_min: Worst keyword (max. shares)
  22. kw_avg_min: Worst keyword (avg. shares)
  23. kw_min_max: Best keyword (min. shares)
  24. kw_max_max: Best keyword (max. shares)
  25. kw_avg_max: Best keyword (avg. shares)
  26. kw_min_avg: Avg. keyword (min. shares)
  27. kw_max_avg: Avg. keyword (max. shares)
  28. kw_avg_avg: Avg. keyword (avg. shares)
  29. self_reference_min_shares: Min. shares of referenced articles in Mashable
  30. self_reference_max_shares: Max. shares of referenced articles in Mashable
  31. self_reference_avg_sharess: Avg. shares of referenced articles in Mashable
  32. weekday_is_monday: Was the article published on a Monday?
  33. weekday_is_tuesday: Was the article published on a Tuesday?
  34. weekday_is_wednesday: Was the article published on a Wednesday?
  35. weekday_is_thursday: Was the article published on a Thursday?
  36. weekday_is_friday: Was the article published on a Friday?
  37. weekday_is_saturday: Was the article published on a Saturday?
  38. weekday_is_sunday: Was the article published on a Sunday?
  39. is_weekend: Was the article published on the weekend?
  40. LDA_00: Closeness to LDA topic 0
  41. LDA_01: Closeness to LDA topic 1
  42. LDA_02: Closeness to LDA topic 2
  43. LDA_03: Closeness to LDA topic 3
  44. LDA_04: Closeness to LDA topic 4
  45. global_subjectivity: Text subjectivity
  46. global_sentiment_polarity: Text sentiment polarity
  47. global_rate_positive_words: Rate of positive words in the content
  48. global_rate_negative_words: Rate of negative words in the content
  49. rate_positive_words: Rate of positive words among non-neutral tokens
  50. rate_negative_words: Rate of negative words among non-neutral tokens
  51. avg_positive_polarity: Avg. polarity of positive words
  52. min_positive_polarity: Min. polarity of positive words
  53. max_positive_polarity: Max. polarity of positive words
  54. avg_negative_polarity: Avg. polarity of negative words
  55. min_negative_polarity: Min. polarity of negative words
  56. max_negative_polarity: Max. polarity of negative words
  57. title_subjectivity: Title subjectivity
  58. title_sentiment_polarity: Title polarity
  59. abs_title_subjectivity: Absolute subjectivity level
  60. abs_title_sentiment_polarity: Absolute polarity level
  61. shares: Number of shares (target)
Methodology Used for training the model

The dataset required some additional cleaning which was done by removing the extra spaces added to column headers and removing the unwanted columns. I had tried using truncatedSVD to the scaled data however the model didn't show much improvement hence i didn't use it in the final version. I trained the model using various machine learning regression models such as Liner Regression, Ridge, Lasso, BayesianRidge and also applied deep learning to train the model. I compared all the trained models based on their root mean squared errors and mean absolute error and finally selected the deep learning model based on the metrics(mean absolute error). I didn't use accuracy as a metric for selection since it is a regression based model and therefore used mean absolute error as a metric for selection. The neural network consists of 2 hidden layers along with the input and output layer. All the inputs are scaled using MinMaxScaler and passed into the input layer. I have used batch sizes of 32, 32, 64 for the input, hidden layer 1, hidden layer 2 respectively based on the accuracy of results i got on testing data while training the model. Finally i saved the model into a h5 file so that it can be used later without training for consistency of results.

Methodology used for scraping the data

In order to test my model on real world data I scrapped the required data to be used as features for the model input from CareerAnna which is a student information site providing information about various technologies and exams . I used Selenium and BeautifulSoup for scraping contents from the website. After scraping the data, it was cleaned and a few NLP operations were performed on it in such that it was converted into a form thatcan be passed as model input along with scaling of the data. Finally the model(.h5 file) was loaded and the data was given as input and the number of shares/likes was predicted which depicted the virality of the published information.An example of the the predicted samples from CareerAnna website can be viewed here.

About

Predicting Virality of content from News and Informational websites

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages