Text Segmentation Using Named Entity Recognition and Co-Reference Resolution in Greek Texts
Abstract
In this paper we examine the benefit of
performing named entity recognition and co-reference
resolution to a Greek corpus used for text segmentation.
Segments consist of portions among one of the 300
documents published by ten different authors in the
Greek newspaper "To Vima". The aim here is to
examine whether the combination of text segmentation
and information extraction (and most specifically the
named entity recognition and co-reference resolution
steps) can prove to be beneficial for the identification of
the various topics that appear in a document. Named
entity recognition was performed using an already
existing tool which was trained on a similar corpus. The
produced annotations were manually corrected and
enriched in order to cover four types of named entities
(i.e. person name, organization, location and time). Coreference
resolution and most specifically substitution
of every reference of the same instance with the same
named entity identifier was performed in a subsequent
step. The evaluation using three well known text
segmentation algorithms leads to the conclusion that,
the benefit highly depends on the segment's topic, the
number of named entity instances appearing in it, as
well as the segment's length.