Suicidal Ideation Detection based on Social Media Dataset using Semantic, Contextual and Graph Neural Network based Hybrid Approach
This project aims to develop a system that can detect suicidal ideation (SI) from Facebook,Twitter and Reddit using Natural Language Processing (NLP) and Deep Learning (DL) models including Long Short Term Memory (LSTM) and Graph Neural Network (GNN). We develop two pipelines. One is LSI based where the LSI topic modeling is peforformed on the data then the output of LSI is embedded with word2vec. The original data is also embedded with Bidirectional Encoder Representation of Transformer (BERT). The concatenated embeddings from word2vec and BERT is used as input in LSTM to detect SI. In another pipeline, we incorporate the power of lexical features and cutting edge technique for constructing a lexical psycholinguistic knowledge-guided graph neural network based model for SI detection. We employ LIWC to extract psycholinguistic features from the collected and pre-processed text data. The LIWC features are used to create graph using k-nearest neighbour. Later, we apply graph neural network on the graph for SI detection. The system aims to identify individuals who may be at risk of suicide and contribute to suicide prevention and suicide preventional policy making approaches.
We collect a total of 785 posts where 386, 321 and 78 posts are from reddit, Facebook and Twitter, respectively. We scrawl and scrap data from those platforms with search keywords ”Suicide”, ”suicidal”, ”self injury”, ”self harm”, and many more related words. . The collected data is annotated as ’YES’ and ’NO’ for suicidal and non suicidal labels, respectively by one behavioural scientist. Thus, 405 posts are annotated as ’YES’ and 380 posts are annotated as ’NO’.
We carefully clean the textual data before executing them into SI detection task since the data can be noisy. We pre-process the data for both approaches. Data pre-processing steps include removing irrelevant characters, stemming and lemmatization and stop words removal etc. Nonsensical characters are not recognizable. to the machine learning models which make the text noisy. It must be deleted from text to ease the classification task. Emojis, URLs, punctuation, white space, numerals, and user references are deducted from the text using regular expressions. We apply porter stemmer and wordnet lemmatizer of nltk to perform stemming and lemmatization to improve text categorization accuracy. Unimportant and frequently occurring words which has little or no grammatical responsibility for text classification is identified as stop words. We use nltk stop words corpus to eliminate stop words to concentrate more on the relevant information.
In pipeline 1, We incorporate the power of word semantics (LSI) and preserving long text (LSTM) and produce an integrated LSI-LSTM model for SI detection. We employ TF-IDF for converting the text data into vectors. Before performing LSI, it is important to ensure term document matrix to be filled with important words by TF-IDF vectors. The TF-IDF vectors are passed through LSI for topic modeling. The output from LSI are embedded with word2vec. The original text data are embedded with BERT embedding. In pipeline 2, We employ LIWC to extract psycholinguistic features from the collected and pre-processed text data.
The concatenated embeddings from word2vec and BERT enter into the LSTM model as input for SI detection. By incorporating BERT, we include the power of contextual word embedding through the pre-trained language model.
The LIWC features are used to create graph using k-nearest neighbour. Graph neural network is applied on the graph for SI detection.