In order to complete this task, a CNN and an RNN are combined in the neural network architecture.The proposed system is as follows:
o Our first option is the graphical user interface (GUI).The user engages with the system at this point.

In order to complete this task, a CNN and an RNN are combined in the neural network architecture. The diagram for this is as follows: A sequence generator architecture like RNN or LSTM can start by converting an image into a feature vector of fixed length, which can then be used to generate a series of words or captions for the image. ResNet50 is the encoder that we have employed for the benefit of this project. The ImageNet Dataset’s million images were divided into a thousand categories using a pretrained model. Since its weights are tuned to identify a lot of things that commonly occur in nature, we can use this net effectively by removing the top layer of 1000 neurons (meant for ImageNet classification) and instead adding a linear layer with the number of neurons same as number of neurons that you’re LSTM is going to output. The RNN consists of a series of LSTM (Long Short- Term Memory) Cells which are used to recursively generate captions given an input image. These cells utilize the concept of recurrence and gates in order to remember information in past time steps. You can watch this or read this to understand more about the same. Eventually, the output from both encoder and decoder is merged and passed to a Dense Layer and finally an output layer which predicts the next word given our image and current sequence.
The proposed system is as follows:
• Our first option is the graphical user interface (GUI). The user engages with the system at this point.
• If a user is a first-time visitor, he or she must login or register
• After completing this, the user will have the choice to upload an image and receive a description of it.
• After the user inputs the link or provides the text, we'll use CNN to extract features from the image and turn it into a feature vector with a fixed length.
• Following the extraction, we pre-process the images by modifying their size, orientation, colour, brightness, and perspective. We also remove a lot of noise from the caption, such as punctuation, during this process.
• The feature vector would then be fed to the RNN, which would then recursively generate a caption for the provided image.
• The users would be shown the description produced by the model as the last step.

