CoQA is a large-scale dataset for building Conversational Question Answering systems. The goal of the CoQA challenge is to measure the ability of machines to understand a text passage and answer a series of interconnected questions that appear in a conversation. CoQA is pronounced as coca .
CoQA paperCoQA contains 127,000+ questions with answers collected from 8000+ conversations. Each conversation is collected by pairing two crowdworkers to chat about a passage in the form of questions and answers. The unique features of CoQA include 1) the questions are conversational; 2) the answers can be free-form text; 3) each answer also comes with an evidence subsequence highlighted in the passage; and 4) the passages are collected from seven diverse domains. CoQA has a lot of challenging phenomena not present in existing reading comprehension datasets, e.g., coreference and pragmatic reasoning.
Browse the examples in CoQA:
Download a copy of the dataset in json format:
To evaluate your models, use the official evaluation script. To run the evaluation, use python evaluate-v1.0.py --data-file <path_to_dev-v1.0.json> --pred-file <path_to_predictions>
.
Once you are satisfied with your model performance on the dev set, you submit it to get the official scores on the test sets. We have two test sets, an in-domain set which constitutes the domains present in the training and the dev sets, and an out-of-domain set which constitutes unseen domains (see the paper for more details). To preserve the integrity of the test results, we do not release the test set to the public. Follow this tutorial on how to submit your model for an official evaluation:
Submission TutorialCoQA contains passages from seven domains. We make five of these public under the following licenses:
Ask us questions at our google group or at siva.reddy@mila.quebec or danqic@cs.princeton.edu.
We thank the SQuAD team for allowing us to use their code and templates for generating this website.
Rank | Model | In-domain | Out-of-domain | Overall |
---|---|---|---|---|
Human Performance Stanford University (Reddy & Chen et al. TACL '19) | 89.4 | 87.4 | 88.8 | |
1 Sep 05, 2019 | RoBERTa + AT + KD (ensemble) Zhuiyi Technology https://arxiv.org/abs/1909.10772 | 91.4 | 89.2 | 90.7 |
1 Apr 22, 2020 | TR-MT (ensemble) WeChatAI | 91.5 | 88.8 | 90.7 |
2 Sep 05, 2019 | RoBERTa + AT + KD (single model) Zhuiyi Technology https://arxiv.org/abs/1909.10772 | 90.9 | 89.2 | 90.4 |
3 Jan 01, 2020 | TR-MT (ensemble) WeChatAI | 91.1 | 87.9 | 90.2 |
4 Mar 29, 2019 | Google SQuAD 2.0 + MMFT (ensemble) MSRA + SDRG | 89.9 | 88.0 | 89.4 |
5 Dec 18, 2019 | TR-MT (single model) WeChatAI | 90.4 | 86.8 | 89.3 |
6 Sep 13, 2019 | XLNet + Augmentation (single model) Xiaoming https://github.com/stevezheng23/xlnet_extension_tf | 89.9 | 86.9 | 89.0 |
7 Mar 29, 2019 | Google SQuAD 2.0 + MMFT (single model) MSRA + SDRG | 88.5 | 86.0 | 87.8 |
7 Mar 29, 2019 | ConvBERT (ensemble) Joint Laboratory of HIT and iFLYTEK Research | 88.7 | 85.4 | 87.8 |
8 Jan 25, 2019 | BERT + MMFT + ADA (ensemble) Microsoft Research Asia | 87.5 | 85.3 | 86.8 |
8 Mar 28, 2019 | ConvBERT (single model) Joint Laboratory of HIT and iFLYTEK Research | 87.7 | 84.6 | 86.8 |
9 Jan 21, 2019 | BERT + MMFT + ADA (single model) Microsoft Research Asia | 86.4 | 81.9 | 85.0 |
10 Apr 28, 2020 | XLNet + MMFT + ADA (single model) NEUKG | 85.7 | 81.7 | 84.6 |
11 Aug 26, 2019 | BERT + AttentionFusionNet (single model) Beijing Kingsoft AI Lab | 85.4 | 77.3 | 83.0 |
12 Jan 03, 2019 | BERT + Answer Verification (single model) Sogou Search AI Group https://github.com/sogou/SMRCToolkit | 83.8 | 80.2 | 82.8 |
13 Jan 06, 2019 | BERT with History Augmented Query (single model) Fudan University NLP Lab | 82.7 | 78.6 | 81.5 |
14 Feb 01, 2019 | BERT Large Finetuned Baseline (single model) Anonymous | 82.6 | 78.4 | 81.4 |
15 Jan 21, 2019 | BERT Large Augmented (single model) Microsoft Dynamics 365 AI Research | 82.5 | 77.6 | 81.1 |
16 Dec 13, 2018 | D-AoA + BERT (single model) Joint Laboratory of HIT and iFLYTEK Research | 81.4 | 77.3 | 80.2 |
17 Aug 01, 2019 | BERT Augmented + AoA (single model) Netease Games AI Lab | 81.1 | 77.4 | 80.0 |
18 Mar 10, 2019 | CNet (single model) Anonymous | 80.9 | 77.1 | 79.8 |
19 Nov 29, 2018 | SDNet (ensemble) Microsoft Speech and Dialogue Research Group https://github.com/Microsoft/SDNet | 80.7 | 75.9 | 79.3 |
20 Feb 22, 2019 | CQANet (single model) Nanjing University | 80.2 | 76.5 | 79.1 |
21 May 09, 2019 | CANet (single model) Northwestern Polytechnical University | 80.1 | 75.7 | 78.9 |
22 Apr 14, 2019 | BERT w/ 2-context (single model) NTT Media Intelligence Laboratories https://arxiv.org/pdf/1905.12848 | 79.8 | 75.9 | 78.7 |
22 Jul 14, 2019 | BERT Finetuned Baseline single model | 79.7 | 76.3 | 78.7 |
23 May 06, 2020 | Bert-MultiChannelFlow (single model) SIAT-NLP | 79.4 | 75.3 | 78.2 |
24 Dec 30, 2018 | BERT-base finetune (single model) Tsinghua University CoAI Lab | 79.8 | 74.1 | 78.1 |
25 Apr 19, 2019 | Bert-FlowDelta (single model) National Taiwan University, MiuLab https://arxiv.org/abs/1908.05117 | 79.2 | 74.1 | 77.7 |
26 Feb 28, 2019 | GraphFlow (single model) RPI and IBM Research https://arxiv.org/pdf/1908.00059.pdf | 78.4 | 74.5 | 77.3 |
27 Nov 26, 2018 | SDNet (single model) Microsoft Speech and Dialogue Research Group https://github.com/Microsoft/SDNet | 78.0 | 73.1 | 76.6 |
28 Aug 29, 2019 | Flow Framework (single model) SIAT NLP Group | 77.0 | 73.1 | 75.8 |
29 Oct 06, 2018 | FlowQA (single model) Allen Institute for Artificial Intelligence https://arxiv.org/abs/1810.06683 | 76.3 | 71.8 | 75.0 |
30 Jul 17, 2019 | HisFurC + BERT single model | 76.0 | 70.4 | 74.4 |
31 Jan 14, 2019 | RNet + PGNet + BERT (single model) Nanjing University | 74.7 | 70.0 | 73.3 |
32 Feb 01, 2019 | XyzNet (single model) Beijing Normal University | 74.3 | 68.8 | 72.7 |
33 Dec 30, 2018 | DrQA + marker features (single model) Stanford University | 71.6 | 65.1 | 69.7 |
34 Dec 10, 2018 | BiDAF++ (single model) Beijing University of Posts and Telecommunications | 71.1 | 65.5 | 69.5 |
35 Sep 27, 2018 | BiDAF++ (single model) Allen Institute for Artificial Intelligence https://arxiv.org/abs/1809.10735 | 69.4 | 63.8 | 67.8 |
36 Nov 22, 2018 | Bert Base Augmented (single model) Fudan University NLP Lab | 68.4 | 61.8 | 66.5 |
37 Dec 18, 2018 | RNet_DotAtt + seq2seq with copy attention (single model) University of Science and Technology of China | 68.1 | 62.3 | 66.4 |
38 Dec 30, 2018 | Simplified BiDAF++ (single model) Peking University | 68.7 | 60.5 | 66.3 |
39 Aug 21, 2018 | DrQA + seq2seq with copy attention (single model) Stanford University https://arxiv.org/abs/1808.07042 | 67.0 | 60.4 | 65.1 |
40 Aug 21, 2018 | Vanilla DrQA (single model) Stanford University https://arxiv.org/abs/1808.07042 | 54.5 | 47.9 | 52.6 |