A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents

What is SpokenWOZ?

SpokenWOZ is a large-scale multi-domain speech-text dataset for spoken task-oriented dialogue modeling, which consists of 203k turns, 5.7k dialogues and 249 hours audios from realistic human-to-human spoken conversations.

Why SpokenWOZ?

The majority of existing TOD datasets are constructed via writing or paraphrasing from annotators rather than being collected from realistic spoken conversations. The written TDO datasets may not be representative of the way people naturally speak in real-world conversations, and make it difficult to train and evaluate models that are specifically designed for spoken TOD. Additionally, the robustness issue, such as ASR noise, also can not be fully explored using these written TOD datasets. Different exsiting spoken TOD datasets, we introduce common spoken characteristics in SpokenWOZ, such like word-by-word processing and commonsense in spoken language. SpokenWOZ also includes cross-turn detection and reasoning slot detection as new challenges to better handle these spoken characteristics. SpoeknWOZ Paper

Getting Started

The data is split into training, dev, and test sets. Download the dataset here (distributed under the CC BY-NC 4.0 license):

SpokenWOZ Audio Training & Dev Sets SpokenWOZ Text Training & Dev Sets SpokenWOZ Audio Test Set SpokenWOZ Text Test Set

Details of baseline models and evaluation script can be found on the following GitHub site: SpokenWOZ Github Page
We will update the models and results on the leaderboard based on the publicly available papers. Feel free to contact Shuzheng Si if you want to submit your results.

How we Construct SpokenWOZ?

Data collection includes (1) Collection of dialogue audio and (2) Annotation of dialogue. Strict quality control is performed at each collection stage. More details can be found in our paper.

Have Questions or Want to Contribute ?

Feel free to contact Shuzheng Si, Wentao Ma, and Haoyu Gao. We would greatly appreciate it if you could provide us your helpful suggestions for this project.

Acknowledgement

We would like to thank Prof. Milica Gasic for her appreciation and advice on our idea at the beginning of the project. We also thank Dr. Bowen Yu for his constructive comments on our writing. Finally, we would like to thank all the annotators for their efforts. We look forward to our dataset advancing the research on spoken TOD.

Citation

@inproceedings{NEURIPS2023_7b16688a,
author = {Si, Shuzheng and Ma, Wentao and Gao, Haoyu and Wu, Yuchuan and Lin, Ting-En and Dai, Yinpei and Li, Hangyu and Yan, Rui and Huang, Fei and Li, Yongbin},
booktitle = {Advances in Neural Information Processing Systems},
editor = {A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},
pages = {39088--39118},
publisher = {Curran Associates, Inc.},
title = {SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents},
url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/ 7b16688a2b053a1b01474ab5c78ce662-Paper-Datasets_and_Benchmarks.pdf},
volume = {36},
year={2023}}

Leadboard - Dialogue State Tracking

We use the joint goal accuracy (JGA) to evaluate DST task, which measures the ratio of dialogue turns for which the value of each slot is correctly predicted. Challenges of DST in spoken dialogue include robustness against noisy text, cross-turn slot and reasoning slot.

Rank Model JGA

1

June 1, 2023
SPACE+WavLMalign

Alibaba DAMO

(Si et al.,'2023)
25.65

2

June 1, 2023
SPACE+WavLM

Alibaba DAMO

(Si et al.,'2023)
24.09

3

June 1, 2023
SPACE

Alibaba DAMO

(He et al.,'2022)
22.73

4

June 1, 2023
UBAR

Sun Yat-sen University

(Yang et al.,'2022)
20.54

5

June 1, 2023
SPACE+WavLM+TriPy

Alibaba DAMO

(Si et al.,'2023)
18.71

6

June 1, 2023
SPACE+TripPy

Alibaba DAMO

(He et al.,'2022)
16.24

7

June 1, 2023
BERT+TripPy

Heinrich Heine University

(Heck et al.,'2020)
14.78

8

June 1, 2023
InstructGPT003

OpenAI

(Ouyang et al.,'2020)
14.15

9

June 1, 2023
ChatGPT

OpenAI

(Ouyang et al.,'2020)
13.75

Leadboard - Response Generation

The next challenges becomes the response generation component, including policy optimization and end-to-end modeling. We use INFORM, SUCCESS, BLEU and Combined Score to evaluate generated response.

End-to-end Modeling

In end-to-end modeling task, given the utterance from the user, the dialogue system is expected to give the right response.

Rank Model INFORM SUCCESS BLEU Comb

1

June 1, 2023
SPACE+WavLMalign

Alibaba DAMO

(Si et al.,'2023)
68.30 52.10 22.12 82.32

2

June 1, 2023
SPACE+WavLM

Alibaba DAMO

(Si et al.,'2023)
67.20 51.30 21.46 80.71

3

June 1, 2023
SPACE

Alibaba DAMO

(He et al.,'2022)
66.40 50.60 21.34 79.84

4

June 1, 2023
GALAXY

Alibaba DAMO

(He et al.,'2022)
65.80 38.50 20.10 72.25

5

June 1, 2023
UBAR

Sun Yat-sen University

(Yang et al.,'2022)
60.20 47.40 9.90 63.70

6

June 1, 2023
InstructGPT003

OpenAI

(Ouyang et al.,'2020)
25.30 18.50 6.13 28.03

7

June 1, 2023
ChatGPT

OpenAI

(Ouyang et al.,'2020)
23.40 13.80 3.59 22.19

Policy Optimization

In policy optimization task, given the utterances and the dialogue state, the dialogue system is expected to optimize conversation policy and give high quality responses.

Rank Model INFORM SUCCESS BLEU Comb

1

June 1, 2023
SPACE+WavLMalign

Alibaba DAMO

(Si et al.,'2023)
77.20 59.20 19.81 88.01

2

June 1, 2023
SPACE+WavLM

Alibaba DAMO

(Si et al.,'2023)
76.80 58.40 18.54 86.14

3

June 1, 2023
SPACE

Alibaba DAMO

(He et al.,'2022)
76.00 57.60 18.72 85.52

4

June 1, 2023
InstructGPT003

OpenAI

(Ouyang et al.,'2020)
78.20 56.90 7.72 75.27

5

June 1, 2023
GALAXY

Alibaba DAMO

(He et al.,'2022)
70.60 42.20 16.52 72.92

6

June 1, 2023
UBAR

Sun Yat-sen University

(Yang et al.,'2022)
62.50 48.10 9.69 64.99

7

June 1, 2023
ChatGPT

OpenAI

(Ouyang et al.,'2020)
73.40 39.50 4.58 61.03