Ask any question about the given image and let the model answer it
Project Description
This project focuses on building a Visual Question Answering (VQA) web application that allows users to upload an image and ask natural language questions about it. The system leverages a pre-trained Vision-and-Language Transformer (ViLT) model to jointly understand both visual and textual inputs and generate accurate answers. The application is deployed using a Flask backend, with a responsive frontend that enables seamless interaction through a web interface.
Features
Upload an image and ask questions in natural language
ViLT-based Vision-Language Transformer model for multimodal understanding
Real-time answer generation from visual and textual input
Web-based interface for easy user interaction
Backend integration via Flask REST APIs
Suitable for AI, deep learning, and NLP-based academic projects
Tech Stack
Python
Vision-and-Language Transformer (ViLT)
Hugging Face Transformers
Flask (Backend Framework)
HTML, CSS, JavaScript (Frontend)
Other supporting Python libraries
System Requirements
Python 3.10+
Anaconda
No GPU is required
Deliverables
Project Code
Setup over video call
WhatsApp support
WhatsApp For Project