Ondřej Dušek presents The 52nd meeting of the Prague computer science seminar
On 2023-02-09 16:15:00 at KN:E-107, Karlovo náměstí 13, Praha 2
Robust Data-to-Text Generation with Pretrained Language Models
The task of data-to-text generation amounts to describing structured data in
fluent natural language sentences. The state-of-the-art approach in research
systems today is finetuning pretrained neural language models (PLMs). This
often
leads to overfitting and hallucinations, i.e. situations where the PLM
generates
outputs that are not grounded in the input, replicating or amplifying training
data noise. Rather than applying a PLM as black box for the whole data-to-text
task, we aim at using PLMs for simple subtasks, aiming to achieve broad
generalization and minimize hallucination.
First, we use a pipeline approach where the PLMs only work as text
“editors”, rather than generators, taking advantage of their high output
fluency. The data is converted into text in an initial preprocessing step,
where
we use simple handcrafted templates recounting the individual input facts (i.e.
relations between entities). The PLMs then order the facts and fuse them into
fluent sentences. This helps us generate without in-domain training data and
achieve good fluency and accuracy. We further examine the capability of PLMs to
produce accurate descriptions of individual facts from the data, in order to
remove the last handcrafted step. Using a specially collected dataset, we show
that PLMs finetuned to describe a variety of relations are very robust in
verbalizing novel, unseen relations. The key to PLMs’ usability here is
providing clear relation names on the input.
https://www.praguecomputerscience.cz/
The task of data-to-text generation amounts to describing structured data in
fluent natural language sentences. The state-of-the-art approach in research
systems today is finetuning pretrained neural language models (PLMs). This
often
leads to overfitting and hallucinations, i.e. situations where the PLM
generates
outputs that are not grounded in the input, replicating or amplifying training
data noise. Rather than applying a PLM as black box for the whole data-to-text
task, we aim at using PLMs for simple subtasks, aiming to achieve broad
generalization and minimize hallucination.
First, we use a pipeline approach where the PLMs only work as text
“editors”, rather than generators, taking advantage of their high output
fluency. The data is converted into text in an initial preprocessing step,
where
we use simple handcrafted templates recounting the individual input facts (i.e.
relations between entities). The PLMs then order the facts and fuse them into
fluent sentences. This helps us generate without in-domain training data and
achieve good fluency and accuracy. We further examine the capability of PLMs to
produce accurate descriptions of individual facts from the data, in order to
remove the last handcrafted step. Using a specially collected dataset, we show
that PLMs finetuned to describe a variety of relations are very robust in
verbalizing novel, unseen relations. The key to PLMs’ usability here is
providing clear relation names on the input.
https://www.praguecomputerscience.cz/
External www: https://www.praguecomputerscience.cz/