OnlineNewsPopularity

Problem Statement : Crawl news & information websites & anticipate the likelihood of its virality. (Bipolar Factory Assignment)

Dataset Used for Training :

I have used the Online News Popularity Data Set available in the UCI machine learning repository to train my model.

Data Set Information(source):

The articles were published by Mashable (www.mashable.com) and their content as the rights to reproduce it belongs to them. Hence, this dataset does not share the original content but some statistics associated with it. The original content be publicly accessed and retrieved using the provided urls.
Acquisition date: January 8, 2015
The estimated relative performance values were estimated by the authors using a Random Forest classifier and a rolling windows as assessment method. See their article for more details on how the relative performance values were set.

Attribute Information:

Number of Attributes: 61 (58 predictive attributes, 2 non-predictive, 1 goal field)

url: URL of the article (non-predictive)
timedelta: Days between the article publication and the dataset acquisition (non-predictive)
n_tokens_title: Number of words in the title
n_tokens_content: Number of words in the content
n_unique_tokens: Rate of unique words in the content
n_non_stop_words: Rate of non-stop words in the content
n_non_stop_unique_tokens: Rate of unique non-stop words in the content
num_hrefs: Number of links
num_self_hrefs: Number of links to other articles published by Mashable
num_imgs: Number of images
num_videos: Number of videos
average_token_length: Average length of the words in the content
num_keywords: Number of keywords in the metadata
data_channel_is_lifestyle: Is data channel 'Lifestyle'?
data_channel_is_entertainment: Is data channel 'Entertainment'?
data_channel_is_bus: Is data channel 'Business'?
data_channel_is_socmed: Is data channel 'Social Media'?
data_channel_is_tech: Is data channel 'Tech'?
data_channel_is_world: Is data channel 'World'?
kw_min_min: Worst keyword (min. shares)
kw_max_min: Worst keyword (max. shares)
kw_avg_min: Worst keyword (avg. shares)
kw_min_max: Best keyword (min. shares)
kw_max_max: Best keyword (max. shares)
kw_avg_max: Best keyword (avg. shares)
kw_min_avg: Avg. keyword (min. shares)
kw_max_avg: Avg. keyword (max. shares)
kw_avg_avg: Avg. keyword (avg. shares)
self_reference_min_shares: Min. shares of referenced articles in Mashable
self_reference_max_shares: Max. shares of referenced articles in Mashable
self_reference_avg_sharess: Avg. shares of referenced articles in Mashable
weekday_is_monday: Was the article published on a Monday?
weekday_is_tuesday: Was the article published on a Tuesday?
weekday_is_wednesday: Was the article published on a Wednesday?
weekday_is_thursday: Was the article published on a Thursday?
weekday_is_friday: Was the article published on a Friday?
weekday_is_saturday: Was the article published on a Saturday?
weekday_is_sunday: Was the article published on a Sunday?
is_weekend: Was the article published on the weekend?
LDA_00: Closeness to LDA topic 0
LDA_01: Closeness to LDA topic 1
LDA_02: Closeness to LDA topic 2
LDA_03: Closeness to LDA topic 3
LDA_04: Closeness to LDA topic 4
global_subjectivity: Text subjectivity
global_sentiment_polarity: Text sentiment polarity
global_rate_positive_words: Rate of positive words in the content
global_rate_negative_words: Rate of negative words in the content
rate_positive_words: Rate of positive words among non-neutral tokens
rate_negative_words: Rate of negative words among non-neutral tokens
avg_positive_polarity: Avg. polarity of positive words
min_positive_polarity: Min. polarity of positive words
max_positive_polarity: Max. polarity of positive words
avg_negative_polarity: Avg. polarity of negative words
min_negative_polarity: Min. polarity of negative words
max_negative_polarity: Max. polarity of negative words
title_subjectivity: Title subjectivity
title_sentiment_polarity: Title polarity
abs_title_subjectivity: Absolute subjectivity level
abs_title_sentiment_polarity: Absolute polarity level
shares: Number of shares (target)

Methodology Used for training the model

The dataset required some additional cleaning which was done by removing the extra spaces added to column headers and removing the unwanted columns. I had tried using truncatedSVD to the scaled data however the model didn't show much improvement hence i didn't use it in the final version. I trained the model using various machine learning regression models such as Liner Regression, Ridge, Lasso, BayesianRidge and also applied deep learning to train the model. I compared all the trained models based on their root mean squared errors and mean absolute error and finally selected the deep learning model based on the metrics(mean absolute error). I didn't use accuracy as a metric for selection since it is a regression based model and therefore used mean absolute error as a metric for selection. The neural network consists of 2 hidden layers along with the input and output layer. All the inputs are scaled using MinMaxScaler and passed into the input layer. I have used batch sizes of 32, 32, 64 for the input, hidden layer 1, hidden layer 2 respectively based on the accuracy of results i got on testing data while training the model. Finally i saved the model into a h5 file so that it can be used later without training for consistency of results.

Methodology used for scraping the data

In order to test my model on real world data I scrapped the required data to be used as features for the model input from CareerAnna which is a student information site providing information about various technologies and exams . I used Selenium and BeautifulSoup for scraping contents from the website. After scraping the data, it was cleaned and a few NLP operations were performed on it in such that it was converted into a form thatcan be passed as model input along with scaling of the data. Finally the model(.h5 file) was loaded and the data was given as input and the number of shares/likes was predicted which depicted the virality of the published information.An example of the the predicted samples from CareerAnna website can be viewed here.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
OnlineNewsPopularity.csv		OnlineNewsPopularity.csv
OnlineNewsPopularity.names		OnlineNewsPopularity.names
README.md		README.md
Virality_prediction.h5		Virality_prediction.h5
prediction.py		prediction.py
scrapingscript.py		scrapingscript.py
viralitypredictionexample.csv		viralitypredictionexample.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OnlineNewsPopularity

Problem Statement : Crawl news & information websites & anticipate the likelihood of its virality. (Bipolar Factory Assignment)

Dataset Used for Training :

Data Set Information(source):

Attribute Information:

Methodology Used for training the model

Methodology used for scraping the data

About

Releases

Packages

Languages

License

b117020/OnlineNewsPopularity

Folders and files

Latest commit

History

Repository files navigation

OnlineNewsPopularity

Problem Statement : Crawl news & information websites & anticipate the likelihood of its virality. (Bipolar Factory Assignment)

Dataset Used for Training :

Data Set Information(source):

Attribute Information:

Methodology Used for training the model

Methodology used for scraping the data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages