Problem Statement : Crawl news & information websites & anticipate the likelihood of its virality. (Bipolar Factory Assignment)
I have used the Online News Popularity Data Set available in the UCI machine learning repository to train my model.
Data Set Information(source):
- The articles were published by Mashable (www.mashable.com) and their content as the rights to reproduce it belongs to them. Hence, this dataset does not share the original content but some statistics associated with it. The original content be publicly accessed and retrieved using the provided urls.
- Acquisition date: January 8, 2015
- The estimated relative performance values were estimated by the authors using a Random Forest classifier and a rolling windows as assessment method. See their article for more details on how the relative performance values were set.
Number of Attributes: 61 (58 predictive attributes, 2 non-predictive, 1 goal field)
- url: URL of the article (non-predictive)
- timedelta: Days between the article publication and the dataset acquisition (non-predictive)
- n_tokens_title: Number of words in the title
- n_tokens_content: Number of words in the content
- n_unique_tokens: Rate of unique words in the content
- n_non_stop_words: Rate of non-stop words in the content
- n_non_stop_unique_tokens: Rate of unique non-stop words in the content
- num_hrefs: Number of links
- num_self_hrefs: Number of links to other articles published by Mashable
- num_imgs: Number of images
- num_videos: Number of videos
- average_token_length: Average length of the words in the content
- num_keywords: Number of keywords in the metadata
- data_channel_is_lifestyle: Is data channel 'Lifestyle'?
- data_channel_is_entertainment: Is data channel 'Entertainment'?
- data_channel_is_bus: Is data channel 'Business'?
- data_channel_is_socmed: Is data channel 'Social Media'?
- data_channel_is_tech: Is data channel 'Tech'?
- data_channel_is_world: Is data channel 'World'?
- kw_min_min: Worst keyword (min. shares)
- kw_max_min: Worst keyword (max. shares)
- kw_avg_min: Worst keyword (avg. shares)
- kw_min_max: Best keyword (min. shares)
- kw_max_max: Best keyword (max. shares)
- kw_avg_max: Best keyword (avg. shares)
- kw_min_avg: Avg. keyword (min. shares)
- kw_max_avg: Avg. keyword (max. shares)
- kw_avg_avg: Avg. keyword (avg. shares)
- self_reference_min_shares: Min. shares of referenced articles in Mashable
- self_reference_max_shares: Max. shares of referenced articles in Mashable
- self_reference_avg_sharess: Avg. shares of referenced articles in Mashable
- weekday_is_monday: Was the article published on a Monday?
- weekday_is_tuesday: Was the article published on a Tuesday?
- weekday_is_wednesday: Was the article published on a Wednesday?
- weekday_is_thursday: Was the article published on a Thursday?
- weekday_is_friday: Was the article published on a Friday?
- weekday_is_saturday: Was the article published on a Saturday?
- weekday_is_sunday: Was the article published on a Sunday?
- is_weekend: Was the article published on the weekend?
- LDA_00: Closeness to LDA topic 0
- LDA_01: Closeness to LDA topic 1
- LDA_02: Closeness to LDA topic 2
- LDA_03: Closeness to LDA topic 3
- LDA_04: Closeness to LDA topic 4
- global_subjectivity: Text subjectivity
- global_sentiment_polarity: Text sentiment polarity
- global_rate_positive_words: Rate of positive words in the content
- global_rate_negative_words: Rate of negative words in the content
- rate_positive_words: Rate of positive words among non-neutral tokens
- rate_negative_words: Rate of negative words among non-neutral tokens
- avg_positive_polarity: Avg. polarity of positive words
- min_positive_polarity: Min. polarity of positive words
- max_positive_polarity: Max. polarity of positive words
- avg_negative_polarity: Avg. polarity of negative words
- min_negative_polarity: Min. polarity of negative words
- max_negative_polarity: Max. polarity of negative words
- title_subjectivity: Title subjectivity
- title_sentiment_polarity: Title polarity
- abs_title_subjectivity: Absolute subjectivity level
- abs_title_sentiment_polarity: Absolute polarity level
- shares: Number of shares (target)
The dataset required some additional cleaning which was done by removing the extra spaces added to column headers and removing the unwanted columns. I had tried using truncatedSVD to the scaled data however the model didn't show much improvement hence i didn't use it in the final version. I trained the model using various machine learning regression models such as Liner Regression, Ridge, Lasso, BayesianRidge and also applied deep learning to train the model. I compared all the trained models based on their root mean squared errors and mean absolute error and finally selected the deep learning model based on the metrics(mean absolute error). I didn't use accuracy as a metric for selection since it is a regression based model and therefore used mean absolute error as a metric for selection. The neural network consists of 2 hidden layers along with the input and output layer. All the inputs are scaled using MinMaxScaler and passed into the input layer. I have used batch sizes of 32, 32, 64 for the input, hidden layer 1, hidden layer 2 respectively based on the accuracy of results i got on testing data while training the model. Finally i saved the model into a h5 file so that it can be used later without training for consistency of results.
In order to test my model on real world data I scrapped the required data to be used as features for the model input from CareerAnna which is a student information site providing information about various technologies and exams . I used Selenium and BeautifulSoup for scraping contents from the website. After scraping the data, it was cleaned and a few NLP operations were performed on it in such that it was converted into a form thatcan be passed as model input along with scaling of the data. Finally the model(.h5 file) was loaded and the data was given as input and the number of shares/likes was predicted which depicted the virality of the published information.An example of the the predicted samples from CareerAnna website can be viewed here.