The SG-NLG dataset is a pre-processed version of the DSTC8 Schema-Guided Dialogue SGD dataset, designed specifically for data-to-text NLG. The original DSTC8 SGD contains ~20,000 dialogues spanning across ~20 domains.
This SG-NLG dataset is designed to make it easier to conduct NLG experiments on the SGD data. We pre-process SGD by pairing the schema for each system turn with the corresponding set of natural language strings that realize it. We also “delexicalize” the prompts (replace related values with fixed names) to convert them into templates that make them more generic for use within a dialog system.
The final SG-NLG dataset is composed of nearly 4K MRs and over 140K templates. Full details on the pre-processing step are given in the paper (see Citation section). Note that we only use the train and dev splits of the original DSTC8 SGD dataset, since at the time of creation the SGD test set did not yet include user intents (which are a necessary part of our pre-processing). Thus, we use SGD train to create our train+dev sets (split into 90% train and 10% dev), and use SGD dev as our test set.