Skip to content

Commit 70f35ab

Browse files
Merge pull request #265 from Weili-NLP/master
open release code for UNIMO-2
2 parents 811f62c + f3932a9 commit 70f35ab

File tree

78 files changed

+113457
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

78 files changed

+113457
-0
lines changed

NLP/UNIMO-2/CHANGELOG.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
Changelog
2+
===
3+
以下记录了项目中所有值得关注的变更内容,其格式基于[Keep a Changelog]
4+
5+
本项目版本遵守[Semantic Versioning][PEP-440]
6+
7+
[Unreleased]
8+
---
9+
### Added
10+
- 这里记录新添加的内容
11+
### Changed
12+
- 这里记录变更的内容
13+
14+
0.1.0 - 2022-05-05
15+
---
16+
### Added
17+
- 创建项目
18+
19+
20+
[Unreleased]: http://icode.baidu.com/repos/baidu/personal-code/UNIMO2-Open/merge/0.1.0...master
21+
22+
[Keep a Changelog]: https://keepachangelog.com/zh-CN/1.0.0/
23+
[Semantic Versioning]: https://semver.org/lang/zh-CN/
24+
[PEP-440]: https://www.python.org/dev/peps/pep-0440/

NLP/UNIMO-2/README-md-bak

Lines changed: 216 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,216 @@
1+
UNIMO
2+
====
3+
Code for the findings of ACL2022 long paper [UNIMO-2: End-to-End Unified Vision-Language Grounded Learning](https://arxiv.org/pdf/2203.09067.pdf)
4+
5+
6+
Abstract
7+
---
8+
9+
Vision-Language Pre-training (VLP) has achieved impressive performance on various cross-modal downstream tasks.
10+
However, most existing methods can only learn from aligned image-caption data and rely heavily on expensive regional
11+
features, which greatly limits their scalability and performance. In this paper, we propose an end-to-end unified-modal
12+
pre-training framework, namely UNIMO-2, for joint learning on both aligned image-caption data and unaligned image-only
13+
and text-only corpus. We build a unified Transformer model to jointly learn visual representations, textual
14+
representations and semantic alignment between images and texts. In particular, we propose to conduct grounded learning
15+
on both images and texts via a sharing grounded space, which helps bridge unaligned images and texts, and align the
16+
visual and textual semantic spaces on different types of corpora. The experiments show that our grounded learning
17+
method can improve textual and visual semantic alignment for improving performance on various cross-modal tasks.
18+
Moreover, benefiting from effective joint modeling of different types of corpora, our model also achieves impressive
19+
performance on single-modal visual and textual tasks. Our code and models are public at the UNIMO project page
20+
\url{https://unimo-ptm.github.io}.
21+
22+
![UNIMO-2](images/paper.png#pic_center)
23+
24+
25+
26+
Dependencies
27+
---
28+
python3.7.4\
29+
cuda-10.1\
30+
cudnn_v7.6\
31+
nccl2.4.2\
32+
java1.8
33+
paddlepaddle-gpu==2.1.2\
34+
pyrouge==0.1.3
35+
36+
37+
Pre-trained Models
38+
---
39+
Similar to UNIMO, UNIMO-2 adopts large-scale text corpus, image collections and image-text aligned datasets as the pre-training data.
40+
We provide pre-trained UNIMO-2 models:
41+
42+
```
43+
cd /path/to/model_files
44+
wget --no-check-certificate -q https://unimo-2.bj.bcebos.com/model/UNIMO-2.tar.gz
45+
tar -zxf UNIMO-2.tar.gz
46+
```
47+
48+
49+
Experiments
50+
---
51+
52+
Our fine-tuning experiments are carried on V100 GPU. Here are the results from the UNIMO-2 model:
53+
54+
55+
1 Cross-Modal Tasks
56+
---
57+
58+
59+
### (1) Image-Text Retrieval
60+
61+
#### Download Flickr30k dataset:
62+
63+
```
64+
cd /path/to/data
65+
wget --no-check-certificate -q https://unimo-2.bj.bcebos.com/data/Flickr30k.tar.gz
66+
tar -zxf Flickr30k.tar.gz
67+
```
68+
69+
#### Run the following common to train and evaluate on the Flickr30k dataset:
70+
71+
```
72+
bash ./script/retrieval-grounded/Flickr30k-fleet/run.sh
73+
```
74+
75+
#### Evaluation Results:
76+
77+
Results of Image Retrieval task on Flickr30k dataset
78+
79+
| Model | R@1 | R@5 | R@10 |
80+
| ----------- | ------- | ------- | ------- |
81+
| UNIMO-2 (zero-shot) | 72.70 | 91.18 | 94.60 |
82+
| UNIMO-2 (finetuned) | 80.14 | 95.58 | 97.75 |
83+
84+
Results of Text Retrieval task on Flickr30k dataset
85+
86+
| Model | R@1 | R@5 | R@10 |
87+
| ----------- | ------- | ------- | ------- |
88+
| UNIMO-2 (zero-shot) | 88.46 | 96.84 | 98.92 |
89+
| UNIMO-2 (finetuned) | 92.01 | 99.31 | 99.51 |
90+
91+
92+
93+
### (2) Image Caption Generation
94+
95+
#### Download COCO Caption dataset:
96+
97+
```
98+
cd /path/to/data
99+
wget --no-check-certificate -q https://unimo-2.bj.bcebos.com/data/coco.tar.gz
100+
tar -zxf coco.tar.gz
101+
```
102+
103+
#### Download evaluation script:
104+
105+
```
106+
mkdir src/eval/tasks
107+
cd src/eval/tasks
108+
wget --no-check-certificate -q https://unimo.bj.bcebos.com/eval_script/coco.tar.gz
109+
tar -zxf coco.tar.gz
110+
```
111+
112+
#### Run the following common to train and evaluate on the COCO Caption dataset:
113+
114+
```
115+
bash ./script/img2txt-grounded/coco-oscar/run.sh
116+
```
117+
118+
119+
#### Evaluation Results:
120+
121+
| Model | BLUE4 | CIDEr |
122+
| ----------- | ------- | ------- |
123+
| UNIMO-2 | 39.7 | 131.2 |
124+
125+
126+
127+
### (3) Visual Entailment
128+
####todo
129+
130+
131+
132+
### (4) Visual Question Answering (VQA)
133+
####todo
134+
135+
136+
137+
138+
139+
2 Visual Tasks
140+
---
141+
142+
### (1) Image Classification
143+
####todo
144+
145+
### (2) Zero-shot Image Classification
146+
####todo
147+
148+
149+
150+
3 Textual Tasks
151+
---
152+
153+
### (1) Natural Language Inference
154+
155+
#### Download MNLI-AX dataset:
156+
```
157+
cd /path/to/data
158+
wget --no-check-certificate -q https://unimo-2.bj.bcebos.com/data/MNLI-AX.tar.gz
159+
tar -zxf MNLI-AX.tar.gz
160+
```
161+
162+
#### Run the following common to train and evaluate on the MNLI-AX dataset:
163+
164+
```
165+
bash ./script/classification/MNLI-AX/run.sh
166+
```
167+
168+
169+
#### Evaluation Results:
170+
171+
| Model | Acc-(m/mm) |
172+
| ----------- | ------- |
173+
| UNIMO-2 | 87.5/87.5 |
174+
175+
176+
177+
178+
### (2) Sentiment Classification
179+
####todo
180+
181+
182+
183+
184+
185+
### (3) Similarity Tasks
186+
####todo
187+
188+
189+
190+
191+
192+
### (4) Linguistic Acceptability Judgments
193+
####todo
194+
195+
196+
197+
198+
199+
Citation
200+
---
201+
If you find our paper and code useful, please cite the following paper:
202+
```
203+
@article{li2022unimo,
204+
title={UNIMO-2: End-to-End Unified Vision-Language Grounded Learning},
205+
author={Li, Wei and Gao, Can and Niu, Guocheng and Xiao, Xinyan and Liu, Hao and Liu, Jiachen and Wu, Hua and Wang, Haifeng},
206+
journal={arXiv preprint arXiv:2203.09067},
207+
year={2022}
208+
}
209+
```
210+
211+
Contact information
212+
---
213+
214+
For help or issues using UNIMO-2, please submit a GitHub issue.
215+
216+
For personal communication related to UNIMO, please contact Wei Li (liwei85@baidu.com), Can Gao (gaocan01@baidu.com), Guocheng Niu (niuguocheng@baidu.com).

0 commit comments

Comments
 (0)