This github is has the code of my final paper. This study tackles the increasing problem of spam text messages by differentiating spam from non-spam texts us-ing Natural Language Processing techniques, specifically Term Frequency-Inverse Document Frequency (TF-IDF). We analyzed a dataset containing both spam and non-spam messages to discover unique linguistic patterns that could aid in their classification. Our approach began with prepro-cessing the data to ensure analytical clarity, followed by employing TF-IDF to assess the significance of words within the texts. We hypothesized that terms with higher TF-IDF scores in spam messages would be indicative of spam. Logistic Regression was utilized to model these relationships and validate the effectiveness of TF-IDF scores as predictive indicators. Our results demonstrated that words related to promotional activities, such as "free" and "prize," scored significantly higher in spam messages, while conversational words dominated non-spam texts. The Logistic Regression model confirmed the reliability of TF-IDF scores in spam detection, exhibiting high accuracy. This study not only reaffirms the utility of TF-IDF in spam identification but also suggests its integration with other machine learning techniques to enhance spam filtering systems. The findings provide valuable insights for devel-oping more effective anti-spam measures in digital com-munication.
The Spam Collection dataset comprises a series of text messages amassed for studying text Spam. It encompasses a total of 5,574 English messages, each labeled as 'ham' for genuine messages, or 'spam' for unsolicited content. With 747 messages identified as spam and 4,827 as non-spam, the dataset presents an imbalance that necessitates careful consideration during the data preparation phase to ensure the efficacy of subsequent analyses. The structure of the dataset ensures that each message is presented on an indi-vidual line. Moreover, every line is split into two distinct columns: 'v1', which denotes the category ('ham' or 'spam'), and 'v2', which presents the unprocessed content of the SMS. This collection has been assembled from a variety of schol-arly sources. It features an assortment of 425 spam text messages that were manually curated from Grumbletext, a public forum in the UK where individuals report unsolicit-ed text messages. The corpus also includes a randomly selected sample of 3,375 non-spam texts from the NUS SMS Corpus (NSC), a repository of around 10,000 genu-ine messages assembled for academic purposes at the Na-tional University of Singapore's Department of Computer Science. Additionally, the dataset integrates 450 non-spam messages derived from the doctoral dissertation of Caroline Tagg. The collection is further enriched by the inclusion of the text Spam Corpus v.0.1 Big, which comprises 1,002 messages categorized as non-spam and 322 classified as spam.