Visual Question Answering Using Deep Learning

Ask any question about the given image and let the model answer it

Project Description

This project focuses on building a Visual Question Answering (VQA) web application that allows users to upload an image and ask natural language questions about it. The system leverages a pre-trained Vision-and-Language Transformer (ViLT) model to jointly understand both visual and textual inputs and generate accurate answers. The application is deployed using a Flask backend, with a responsive frontend that enables seamless interaction through a web interface.

Features

Upload an image and ask questions in natural language
ViLT-based Vision-Language Transformer model for multimodal understanding
Real-time answer generation from visual and textual input
Web-based interface for easy user interaction
Backend integration via Flask REST APIs
Suitable for AI, deep learning, and NLP-based academic projects

Tech Stack

Python
Vision-and-Language Transformer (ViLT)
Hugging Face Transformers
Flask (Backend Framework)
HTML, CSS, JavaScript (Frontend)
Other supporting Python libraries

System Requirements

Python 3.10+
Anaconda
No GPU is required

Deliverables

Project Code
Setup over video call
WhatsApp support

WhatsApp For Project

Page updated

Google Sites

Report abuse