A project to implement extractive Text Summarization Using OCR and Attention Networks. In this project, we propose to build a model that performs extractive summarization of a news article with the aid of Optical Character Recognition and Attention Networks. We achieve this by building a model with the algorithms of Recurrent Neural Networks and Bi-directional Long Short Term Memory. We use Bahadanu Attention with the neural architecture to achieve the Attention Network. It’s main objective is to summarize the text from an image of an article and display it’s result.
Basic software requirements-
- Python 3
- Anaconda for python 3
- Tensorflow
- Create Tensorflow environment
conda create -n tensorflow_env tensorflow
conda activate tensorflow_env
- Install pytesseract
- Install the following:
- Git
- Nodejs
- NPM
- Bower
- Only for first time installation
git clonehttps://github.com/mitali3112/Text-Summarizer.git
- Enter the server folder and execute the notebook titled "Text Summarization.ipynb" -Compile and run all the cells -Download the dataset from kaggle from the link given below -Replace the paths in the notebook with relevant paths in your system -Save the trained model, embeddings and tensorboard paths in your system. -Train the model and complete executing all the cells -Save the word vectors in pickle format in the server folder. -Change the path for the trained model in the file testprocess.py
- Enter that folder app/ -bower install -npm install -node run_app.js Setup done
- Keep relevant images or text ready for running the file.
In different terminal tabs (All actvated under tensorflow environment created)
- Got to app/
node run_app.js
- Go to server/
python3 server.py
3.Now go to http://localhost:8000/ and the frontend is there.
- To launch the tensorboard
tensorboard --logdir='full path to tensorboard savepath'
Download the dataset from the link All the News Dataset
Times of India dataset is present in the repository as toi.csv
- Recurrent Neural Networks
- Long Short Term Memory
- Bahadanu Attention
- Adam Optimizer
- Tesseract OCR
- Mitali Sheth - Mitali Sheth