CoQA: A Conversational Question Answering Challenge

What is CoQA?

CoQA is a large-scale dataset for building Conversational Question Answering systems. The goal of the CoQA challenge is to measure the ability of machines to understand a text passage and answer a series of interconnected questions that appear in a conversation. CoQA is pronounced as coca .

CoQA paper

CoQA contains 127,000+ questions with answers collected from 8000+ conversations. Each conversation is collected by pairing two crowdworkers to chat about a passage in the form of questions and answers. The unique features of CoQA include 1) the questions are conversational; 2) the answers can be free-form text; 3) each answer also comes with an evidence subsequence highlighted in the passage; and 4) the passages are collected from seven diverse domains. CoQA has a lot of challenging phenomena not present in existing reading comprehension datasets, e.g., coreference and pragmatic reasoning.

Download

Browse the examples in CoQA:

Browse CoQA

Download a copy of the dataset in json format:

Evaluation

To evaluate your models, use the official evaluation script. To run the evaluation, use python evaluate-v1.0.py --data-file <path_to_dev-v1.0.json> --pred-file <path_to_predictions>.

Once you are satisfied with your model performance on the dev set, you submit it to get the official scores on the test sets. We have two test sets, an in-domain set which constitutes the domains present in the training and the dev sets, and an out-of-domain set which constitutes unseen domains (see the paper for more details). To preserve the integrity of the test results, we do not release the test set to the public. Follow this tutorial on how to submit your model for an official evaluation:

Submission Tutorial

License

CoQA contains passages from seven domains. We make five of these public under the following licenses:

Literature and Wikipedia passages are shared under CC BY-SA 4.0 license.
Children's stories are collected from MCTest which comes with MSR-LA license.
Middle/High school exam passages are collected from RACE which comes with its own license.
News passages are collected from the DeepMind CNN dataset which comes with Apache license.

Questions?

Ask us questions at our google group or at siva.reddy@mila.quebec or danqic@cs.princeton.edu.

Acknowledgements

We thank the SQuAD team for allowing us to use their code and templates for generating this website.

Leaderboard

Rank	Model	In-domain	Out-of-domain	Overall
	Human Performance Stanford University (Reddy & Chen et al. TACL '19)	89.4	87.4	88.8
1 Sep 05, 2019	RoBERTa + AT + KD (ensemble) Zhuiyi Technology https://arxiv.org/abs/1909.10772	91.4	89.2	90.7
1 Apr 22, 2020	TR-MT (ensemble) WeChatAI	91.5	88.8	90.7
2 Sep 05, 2019	RoBERTa + AT + KD (single model) Zhuiyi Technology https://arxiv.org/abs/1909.10772	90.9	89.2	90.4
3 Jan 01, 2020	TR-MT (ensemble) WeChatAI	91.1	87.9	90.2
4 Mar 29, 2019	Google SQuAD 2.0 + MMFT (ensemble) MSRA + SDRG	89.9	88.0	89.4
5 Dec 18, 2019	TR-MT (single model) WeChatAI	90.4	86.8	89.3
6 Sep 13, 2019	XLNet + Augmentation (single model) Xiaoming https://github.com/stevezheng23/xlnet_extension_tf	89.9	86.9	89.0
7 Mar 29, 2019	Google SQuAD 2.0 + MMFT (single model) MSRA + SDRG	88.5	86.0	87.8
7 Mar 29, 2019	ConvBERT (ensemble) Joint Laboratory of HIT and iFLYTEK Research	88.7	85.4	87.8
8 Jan 25, 2019	BERT + MMFT + ADA (ensemble) Microsoft Research Asia	87.5	85.3	86.8
8 Mar 28, 2019	ConvBERT (single model) Joint Laboratory of HIT and iFLYTEK Research	87.7	84.6	86.8
9 Jan 21, 2019	BERT + MMFT + ADA (single model) Microsoft Research Asia	86.4	81.9	85.0
10 Apr 28, 2020	XLNet + MMFT + ADA (single model) NEUKG	85.7	81.7	84.6
11 Aug 26, 2019	BERT + AttentionFusionNet (single model) Beijing Kingsoft AI Lab	85.4	77.3	83.0
12 Jan 03, 2019	BERT + Answer Verification (single model) Sogou Search AI Group https://github.com/sogou/SMRCToolkit	83.8	80.2	82.8
13 Jan 06, 2019	BERT with History Augmented Query (single model) Fudan University NLP Lab	82.7	78.6	81.5
14 Feb 01, 2019	BERT Large Finetuned Baseline (single model) Anonymous	82.6	78.4	81.4
15 Jan 21, 2019	BERT Large Augmented (single model) Microsoft Dynamics 365 AI Research	82.5	77.6	81.1
16 Dec 13, 2018	D-AoA + BERT (single model) Joint Laboratory of HIT and iFLYTEK Research	81.4	77.3	80.2
17 Aug 01, 2019	BERT Augmented + AoA (single model) Netease Games AI Lab	81.1	77.4	80.0
18 Mar 10, 2019	CNet (single model) Anonymous	80.9	77.1	79.8
19 Nov 29, 2018	SDNet (ensemble) Microsoft Speech and Dialogue Research Group https://github.com/Microsoft/SDNet	80.7	75.9	79.3
20 Feb 22, 2019	CQANet (single model) Nanjing University	80.2	76.5	79.1
21 May 09, 2019	CANet (single model) Northwestern Polytechnical University	80.1	75.7	78.9
22 Apr 14, 2019	BERT w/ 2-context (single model) NTT Media Intelligence Laboratories https://arxiv.org/pdf/1905.12848	79.8	75.9	78.7
22 Jul 14, 2019	BERT Finetuned Baseline single model	79.7	76.3	78.7
23 May 06, 2020	Bert-MultiChannelFlow (single model) SIAT-NLP	79.4	75.3	78.2
24 Dec 30, 2018	BERT-base finetune (single model) Tsinghua University CoAI Lab	79.8	74.1	78.1
25 Apr 19, 2019	Bert-FlowDelta (single model) National Taiwan University, MiuLab https://arxiv.org/abs/1908.05117	79.2	74.1	77.7
26 Feb 28, 2019	GraphFlow (single model) RPI and IBM Research https://arxiv.org/pdf/1908.00059.pdf	78.4	74.5	77.3
27 Nov 26, 2018	SDNet (single model) Microsoft Speech and Dialogue Research Group https://github.com/Microsoft/SDNet	78.0	73.1	76.6
28 Aug 29, 2019	Flow Framework (single model) SIAT NLP Group	77.0	73.1	75.8
29 Oct 06, 2018	FlowQA (single model) Allen Institute for Artificial Intelligence https://arxiv.org/abs/1810.06683	76.3	71.8	75.0
30 Jul 17, 2019	HisFurC + BERT single model	76.0	70.4	74.4
31 Jan 14, 2019	RNet + PGNet + BERT (single model) Nanjing University	74.7	70.0	73.3
32 Feb 01, 2019	XyzNet (single model) Beijing Normal University	74.3	68.8	72.7
33 Dec 30, 2018	DrQA + marker features (single model) Stanford University	71.6	65.1	69.7
34 Dec 10, 2018	BiDAF++ (single model) Beijing University of Posts and Telecommunications	71.1	65.5	69.5
35 Sep 27, 2018	BiDAF++ (single model) Allen Institute for Artificial Intelligence https://arxiv.org/abs/1809.10735	69.4	63.8	67.8
36 Nov 22, 2018	Bert Base Augmented (single model) Fudan University NLP Lab	68.4	61.8	66.5
37 Dec 18, 2018	RNet_DotAtt + seq2seq with copy attention (single model) University of Science and Technology of China	68.1	62.3	66.4
38 Dec 30, 2018	Simplified BiDAF++ (single model) Peking University	68.7	60.5	66.3
39 Aug 21, 2018	DrQA + seq2seq with copy attention (single model) Stanford University https://arxiv.org/abs/1808.07042	67.0	60.4	65.1
40 Aug 21, 2018	Vanilla DrQA (single model) Stanford University https://arxiv.org/abs/1808.07042	54.5	47.9	52.6