Vision-Language

2021-12-27

クロスモーダル事前学習不要のVQAモデル, Multimodal Bitransformer

Supervised Multimodal Bitransformers for Classifying Images and Text https://arxiv.org/abs/1909.02950 2019 Architecture VQAにおいて，個別に事前学習済みの画像encoder, text encoderを組み合わせてBERTベースモデルでSAすることで，VilBERTのような…

2021-12-24

vision分野で多様な下流タスクに適用できる基礎モデルFlorence

DeepLearning Pre-Training Vision-Language Transformer

Florence: A New Foundation Model for Computer Vision 2021/11/22 https://arxiv.org/abs/2111.11432 Fig.2 Overview of building Florence 画像ドメインで多様な下流タスク(分類、検索、オブジェクト検出、VQA、画像キャプション、ビデオ検索、アクション…