A Novel Web Archiving Approach based on Visual Pages Analysis
Ben Saad, Myriam
MetadataShow full item record
Due to the growing importance of the World Wide Web, archiving the web has become a cultural necessity in preserving knowledge. To maintain a web archive up-to-date, crawlers harvest the web by iteratively downloading new versions of documents. However, it is frequent that crawlers retrieve pages with unimportant changes such as advertisements which are continually updated. Hence, web archive systems waste time and space for indexing and storing useless page versions. In this paper, we present a novel approach that detects important changes between versions in order to efficiently archive the web. Our approach combines the concept of the visual pages segmentation with the concept of importance while detecting changes between versions. The approach consists of archiving the visual layout structure of a web page represented by semantic blocks. We propose an adequate changes detection algorithm to compute differences between these visual layout structures of documents. We describe also a method to evaluate the importance of detected changes. Tests were conducted to evaluate the feasibility of our approach. Experimental results show promising performances of our approach.