
 CoQA is a large-scale dataset for building Conversational Question Answering systems. The goal of the CoQA challenge is to measure the ability of machines to understand a text passage and answer a series of interconnected questions that appear in a conversation. CoQA is pronounced as coca  .
.
CoQA contains 127,000+ questions with answers collected from 8000+ conversations. Each conversation is collected by pairing two crowdworkers to chat about a passage in the form of questions and answers. The unique features of CoQA include 1) the questions are conversational; 2) the answers can be free-form text; 3) each answer also comes with an evidence subsequence highlighted in the passage; and 4) the passages are collected from seven diverse domains. CoQA has a lot of challenging phenomena not present in existing reading comprehension datasets, e.g., coreference and pragmatic reasoning.
Browse the examples in CoQA:
Download a copy of the dataset in json format:
To evaluate your models, use the official evaluation script. To run the evaluation, use python evaluate-v1.0.py --data-file <path_to_dev-v1.0.json> --pred-file <path_to_predictions>.
Once you are satisfied with your model performance on the dev set, you submit it to get the official scores on the test sets. We have two test sets, an in-domain set which constitutes the domains present in the training and the dev sets, and an out-of-domain set which constitutes unseen domains (see the paper for more details). To preserve the integrity of the test results, we do not release the test set to the public. Follow this tutorial on how to submit your model for an official evaluation:
Submission TutorialCoQA contains passages from seven domains. We make five of these public under the following licenses:
Ask us questions at our google group or at siva.reddy@mila.quebec or danqic@cs.princeton.edu.
We thank the SQuAD team for allowing us to use their code and templates for generating this website.
| Rank | Model | In-domain | Out-of-domain | Overall | 
|---|---|---|---|---|
| Human Performance Stanford University(Reddy & Chen et al. TACL '19) | 89.4 | 87.4 | 88.8 | |
| 1Sep 05, 2019 | RoBERTa + AT + KD (ensemble) Zhuiyi Technologyhttps://arxiv.org/abs/1909.10772 | 91.4 | 89.2 | 90.7 | 
| 1Apr 22, 2020 | TR-MT (ensemble) WeChatAI | 91.5 | 88.8 | 90.7 | 
| 2Sep 05, 2019 | RoBERTa + AT + KD (single model) Zhuiyi Technologyhttps://arxiv.org/abs/1909.10772 | 90.9 | 89.2 | 90.4 | 
| 3Jan 01, 2020 | TR-MT (ensemble) WeChatAI | 91.1 | 87.9 | 90.2 | 
| 4Mar 29, 2019 | Google SQuAD 2.0 + MMFT (ensemble) MSRA + SDRG | 89.9 | 88.0 | 89.4 | 
| 5Dec 18, 2019 | TR-MT (single model) WeChatAI | 90.4 | 86.8 | 89.3 | 
| 6Sep 13, 2019 | XLNet + Augmentation (single model) Xiaominghttps://github.com/stevezheng23/xlnet_extension_tf | 89.9 | 86.9 | 89.0 | 
| 7Mar 29, 2019 | Google SQuAD 2.0 + MMFT (single model) MSRA + SDRG | 88.5 | 86.0 | 87.8 | 
| 7Mar 29, 2019 | ConvBERT (ensemble) Joint Laboratory of HIT and iFLYTEK Research | 88.7 | 85.4 | 87.8 | 
| 8Jan 25, 2019 | BERT + MMFT + ADA (ensemble) Microsoft Research Asia | 87.5 | 85.3 | 86.8 | 
| 8Mar 28, 2019 | ConvBERT (single model) Joint Laboratory of HIT and iFLYTEK Research | 87.7 | 84.6 | 86.8 | 
| 9Jan 21, 2019 | BERT + MMFT + ADA (single model) Microsoft Research Asia | 86.4 | 81.9 | 85.0 | 
| 10Apr 28, 2020 | XLNet + MMFT + ADA (single model) NEUKG | 85.7 | 81.7 | 84.6 | 
| 11Aug 26, 2019 | BERT + AttentionFusionNet (single model) Beijing Kingsoft AI Lab | 85.4 | 77.3 | 83.0 | 
| 12Jan 03, 2019 | BERT + Answer Verification (single model) Sogou Search AI Grouphttps://github.com/sogou/SMRCToolkit | 83.8 | 80.2 | 82.8 | 
| 13Jan 06, 2019 | BERT with History Augmented Query (single model) Fudan University NLP Lab | 82.7 | 78.6 | 81.5 | 
| 14Feb 01, 2019 | BERT Large Finetuned Baseline (single model) Anonymous | 82.6 | 78.4 | 81.4 | 
| 15Jan 21, 2019 | BERT Large Augmented (single model) Microsoft Dynamics 365 AI Research | 82.5 | 77.6 | 81.1 | 
| 16Dec 13, 2018 | D-AoA + BERT (single model) Joint Laboratory of HIT and iFLYTEK Research | 81.4 | 77.3 | 80.2 | 
| 17Aug 01, 2019 | BERT Augmented + AoA (single model) Netease Games AI Lab | 81.1 | 77.4 | 80.0 | 
| 18Mar 10, 2019 | CNet (single model) Anonymous | 80.9 | 77.1 | 79.8 | 
| 19Nov 29, 2018 | SDNet (ensemble) Microsoft Speech and Dialogue Research Grouphttps://github.com/Microsoft/SDNet | 80.7 | 75.9 | 79.3 | 
| 20Feb 22, 2019 | CQANet (single model) Nanjing University | 80.2 | 76.5 | 79.1 | 
| 21May 09, 2019 | CANet (single model) Northwestern Polytechnical University | 80.1 | 75.7 | 78.9 | 
| 22Apr 14, 2019 | BERT w/ 2-context (single model) NTT Media Intelligence Laboratorieshttps://arxiv.org/pdf/1905.12848 | 79.8 | 75.9 | 78.7 | 
| 22Jul 14, 2019 | BERT Finetuned Baseline single model | 79.7 | 76.3 | 78.7 | 
| 23May 06, 2020 | Bert-MultiChannelFlow (single model) SIAT-NLP | 79.4 | 75.3 | 78.2 | 
| 24Dec 30, 2018 | BERT-base finetune (single model) Tsinghua University CoAI Lab | 79.8 | 74.1 | 78.1 | 
| 25Apr 19, 2019 | Bert-FlowDelta (single model) National Taiwan University, MiuLabhttps://arxiv.org/abs/1908.05117 | 79.2 | 74.1 | 77.7 | 
| 26Feb 28, 2019 | GraphFlow (single model) RPI and IBM Researchhttps://arxiv.org/pdf/1908.00059.pdf | 78.4 | 74.5 | 77.3 | 
| 27Nov 26, 2018 | SDNet (single model) Microsoft Speech and Dialogue Research Grouphttps://github.com/Microsoft/SDNet | 78.0 | 73.1 | 76.6 | 
| 28Aug 29, 2019 | Flow Framework (single model) SIAT NLP Group | 77.0 | 73.1 | 75.8 | 
| 29Oct 06, 2018 | FlowQA (single model) Allen Institute for Artificial Intelligencehttps://arxiv.org/abs/1810.06683 | 76.3 | 71.8 | 75.0 | 
| 30Jul 17, 2019 | HisFurC + BERT single model | 76.0 | 70.4 | 74.4 | 
| 31Jan 14, 2019 | RNet + PGNet + BERT (single model) Nanjing University | 74.7 | 70.0 | 73.3 | 
| 32Feb 01, 2019 | XyzNet (single model) Beijing Normal University | 74.3 | 68.8 | 72.7 | 
| 33Dec 30, 2018 | DrQA + marker features (single model) Stanford University | 71.6 | 65.1 | 69.7 | 
| 34Dec 10, 2018 | BiDAF++ (single model) Beijing University of Posts and Telecommunications | 71.1 | 65.5 | 69.5 | 
| 35Sep 27, 2018 | BiDAF++ (single model) Allen Institute for Artificial Intelligencehttps://arxiv.org/abs/1809.10735 | 69.4 | 63.8 | 67.8 | 
| 36Nov 22, 2018 | Bert Base Augmented (single model) Fudan University NLP Lab | 68.4 | 61.8 | 66.5 | 
| 37Dec 18, 2018 | RNet_DotAtt + seq2seq with copy attention (single model) University of Science and Technology of China | 68.1 | 62.3 | 66.4 | 
| 38Dec 30, 2018 | Simplified BiDAF++ (single model) Peking University | 68.7 | 60.5 | 66.3 | 
| 39Aug 21, 2018 | DrQA + seq2seq with copy attention (single model) Stanford Universityhttps://arxiv.org/abs/1808.07042 | 67.0 | 60.4 | 65.1 | 
| 40Aug 21, 2018 | Vanilla DrQA (single model) Stanford Universityhttps://arxiv.org/abs/1808.07042 | 54.5 | 47.9 | 52.6 |