Deep Learning Image Captioning Model on Flickr8K dataset using TensorFlow and Keras in python language.
- Introduction
- Flickr8K Dataset
- Image Feature Extraction
- Caption Data Analysis
- Prepare Captions
- Model Architecture
- Loss Function
- Training the model
- Generating the captions
- BLEU Score
- Testing the model
- Results
- Future Work
- References
A Data Generator function is defined to get data in batch instead taking it altogether to avoid session crash. The entire data is split into train and test and the model training is done on the train data. The loss decreases gradually over the iterations, number of epochs and batch size are assigned accordingly for better results.
Captions are generated for the image by converting the predicted index from the model into a word. All the words for the image are generated, the caption starts with ’startseq’ and the model continues to predict the caption until the ’endseq’ appeared.BLEU (Bilingual Evaluation Understudy) is a well-acknowledged metric to measure the similarity of one hypothesis sentence to multiple reference sentences. Given a single hypothesis sentence and multiple reference sentences, it returns value between 0 and 1. The metric close to 1 means that the two are very similar. The Python Natural Language Toolkit library, or NLTK, provides an implementation of the BLEU score that you can use to evaluate your generated text against a reference.
Model is tested over the test data, BLEU Score is evaluated to study the performance of the model with the predicted caption against the actual captions, in a list of tokens. Finally, the results are visualised for 6 test images containing the actual captions, a predicted caption, the BLEU score and a comment whether the predicted caption is Bad, Not Bad and Good depending on the BLEU score against the actual captions for that particular image. A smaller dataset (Flickr8k) was used due to limited computational power. Potential improvement can be done by training on a combination of Flickr8k, Flickr30k, and MSCOCO. Pre-trained CNN network was directly used as part of our pipeline without fine-tuning, so the network does not adapt to this specific training dataset. Thus, by experimenting with different CNN pre-trained networks and enabling fine-tuning, we can expect to achieve a slightly higher BLEU- 4 score. Video captioning is a text description of video content generation. Compared with image captioning, the scene changes greatly and contains more information than a static image. Therefore, for the generation of text description, video caption needs to extract more features, which we can be the next repository.[1] Xu, Kelvin, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio.“Show, attend and tell: Neural image caption generation with visual at- tention.” arXiv preprint arXiv:1502.03044(2015). [2] Qichen Fu, Yige Liu, Zijian Xie University of Michigan, Ann Arbor fuqichen, yigel, [email protected]: EECS442 Final Project Report eecs442-report.pdf [3] https://neurohive.io/en/popular-networks/vgg16/ [4] An Introduction to Neural Network Loss Functions https://programmathically.com/an- introduction-to-neural-network-loss-functions/ [5] Bleu Score https://en.wikipedia.org/wiki/BLEU [6] Flickr8k Dataset https://www.kaggle.com/datasets/adityajn105/flickr8k [7] Convolutional Neural Network https://towardsdatascience.com/a-comprehensive- guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53 [8] LTSM https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a- step-by-step-explanation-44e9eb85bf21 [9] VGG-16 — CNN model https://www.geeksforgeeks.org/vgg-16-cnn-model/ [10] Image captioning with visual attention https://www.tensorflow.org/tutorials/text/ image_captioning