Previous efforts in visual question answering and text-image matching also faced this limitation, requiring specialized ...