In similarity_sentence_document.ipynb we first gonna split document into sentences and then we gonna convert document into matrix of count of tokens means each row represents sentence and each column represent word and value will be count of that word in that sentence,here total no of columns will be totall no of words in document after cleaning.Then we gonna use cosine similarity to find similarity matrix between sentences by dotproduct between their rows in count matrix.
In imdb model we first found out words and their frequency and alloted them no as how frequent they are like no 1 is most frquent word.Then we convert each sentence into matrix of shape 1Xn where n is no of words in sentence and value is how frequent that word is in the document and then later in embedding we made matrix with one hot encoding means total no of columns =total no of words all over and value =1 at positions where columns = values of how frequent those words(present in document) are. https://wordpress.com/post/datasciencebasicsblog.wordpress.com/1034 Give this approach a look at https://datasciencebasicsblog.wordpress.com/2018/06/02/text-summarization-approaches/ Reference for implementation : https://github.com/xiaoxu193/PyTeaser Code implementation : LDA_Topic_Modeling.ipynbSee here https://datasciencebasicsblog.wordpress.com/topic-modeling-with-python/
To know about language model and RNN and its implementation steps go to :https://datasciencebasicsblog.wordpress.com/2018/03/03/nlp-recurrent-neural-networks-and-language-models/ https://datasciencebasicsblog.wordpress.com/2018/08/20/making-a-language-model-using-python/