Incorporating Visual-Linguistic Features into Scientific Document Summarization

Incorporating Visual-Linguistic Features into Scientific Document Summarization

Since the emergence of COVID-19, tens of thousands of research papers have been published (especially in science domains like medicine), making it difficult to keep track of such rapid progress by manual efforts alone. The unprecedented urgency has presented a pressing need for advancing natural language processing (NLP) techniques for scientific document summarization, so as to help scientists quickly grasp the main insights in literature. Scientific literature is deemed to be visually-rich documents, conveying not only text, but also rich visual-linguistic contents like figures and typography, which carry crucial information. For example, figures can complement text summary, and typography like bullet layout can signify research highlights. Consequently, summarizing literature requires a multimodal model that combines text, figures and typography. However, the development in scientific NLP has yet to catch up to this trend, and the cutting-edge summarizers are mostly text-only. This lays down the objective of this proposal: To investigate the use of visual-linguistic information for scientific summarization.

 

Visual-linguistic information has proven useful for text understanding tasks. We argue that it is also useful for text generation. Particularly, we will design a summarization framework to effectively connect text, figures and typography, as well as to exploit novel fusing mechanisms to aggregate them at spatial (e.g., text-figure ratio), structural (e.g., sibling text-figures) and semantic levels (e.g., text-figure complementarity). To further equip our summarizer with domain knowledge in science, we will design a set of multimodal pre-training tasks, which not only enrich our summarizer with the unique properties of scientific documents (e.g., section-wise contents), but also learn the cross correspondence between different modalities (e.g., generating multimodal summary with awareness of the text-figure ratio in each paper section). Finally, a benchmark, namely Paper2Poster, will be developed to further evaluate our summarizer’s potential for advancing practical scientific summarization tasks. Paper2Poster not only challenges summarizers to have the skills to abstract visual and textual contents in the source/input papers, but also the ability to output them in a visually-structured layout for poster presentation (e.g., zig-zag). Correspondingly, we will extend our summarizer with a new probabilistic generative model that supports content-aware layout generation based on the summarized contents’ semantics.

 

While this proposal aims to develop innovative multimodal solutions for scientific summarization, its research achievements are valuable for applications beyond summarization that also demands visual attractiveness and efficient communication of messages/ideas, and will eventually contribute to the development of intelligent machines that have effective visual communication skills.