The popularity of text-to-image models has spurned an entirely new field of prompt engineering. Part art and part unsettled science, ML practitioners and researchers are rapidly grappling with understanding the relationships between prompts and the images they generate. In this project, we aim to reverse the typical direction of a generative text-to-image model: instead of generating an image from a text prompt, can you create a model which can predict the text prompt given a generated image? You will make predictions on a dataset containing a wide variety of (prompt, image) pairs generated by Stable Diffusion 2.0 to understand how reversible the latent relationship is.
I have used the encoder-decoder model to create our image caption generator, with the encoder as a CNN network and the decoder as an LSTM network, and used Flickr8k dataset for training.
For feature extraction, the model uses ResNet50 pre-trained on ImageNet dataset (which needs an image input of size 2242243) where the features of the image are extracted just before the last layer of classification. Another dense layer is added and converted to get a vector of length 2048.
I have used a normal RNN(LSTM) as decoder.
- Show and Tell: A Neural Image Caption Generator (https://arxiv.org/abs/1411.4555)
- Tensorflow Tutorials
This project is done under the guidance of Vision And Language Group @IIT Roorkee