Text Summarization of Japanese Document using NLP

October 7, 2021

Overview

The client is a Japan-based leading global provider of integrated solutions in printing​, communications, security, packaging, décor materials, and electronics. They wanted to develop an AI to summarize any Japanese document into a 3-sentence paragraph for an upcoming technology offer. 

Key Features

  • Analyze Japanese documents
  • Identify key ideas
  • Summarize the document into a 3-sentence paragraph

Technical Stacks

  • CRF
  • LSTM
  • BERT
  • T5
  • GPT2

Project Overview

  • Team: 2 people
  • Market: Japan
gemvietnam

Background

Business people receive emails with hundred-page documents daily, making it impossible to read everything at once. Meanwhile, they still need to get a grasp of it to know what is happening.  The AI that we are developing aims to analyze these documents, quickly summarizing them in 3 sentences, and send them back to users. That way, they can briefly understand the document and decide whether they need to read everything.

 

Challenges

We wanted this AI to be able to summarize text from different domains. However, it would be labor-intensive to use a traditional machine learning model to develop an AI for each domain. Therefore, we needed to work on Transfer Learning, which can transfer knowledge from one context to another.  Transfer Learning is more complex when applied in NLP than in Visual Learning. We had to use some of the latest language models including T5 (Text-to-Text Transfer Transformer) by Google, BERT, GPT2, etc. 

Meanwhile, Japanese is a sophisticated language, which makes it more challenging to analyze. We need to identify three scripts of Japanese (hiragana, katakana, kanji). Then we analyze different Parts of Speech (POS). Japanese text summarization AI is not new. But when it comes to domain-specific topics, it requires more complex solutions. 

 

Solutions

First, we needed to extract keyphrases from the source document. Then, we labeled those keyphrases and trained a binary machine learning classifier to make the text summarization. Finally, in the testing phases, we carried out classification for the created keyphrase words and sentences. 

The client reviews the AI for its usefulness and usability every week. Then they give us feedback to improve the model. We are deploying this AI to summarize banking documents. After months of development, our AI has given more and more precise output. 

 

 

Impacts

Developing an AI can be more time-consuming than developing a typical application. But the impact it can bring to businesses is tremendous.  With this AI, businesses can save hours on reading endless documents yet still ensure people get the points. Besides banks, we are looking forward to bringing this AI to any business that wants to summarise its documents.