A Novel Web Archiving Approach based on Visual Pages Analysis
View/ Open
Date
2009Author
Ben Saad, Myriam
Pehlivan, Zeynep
Gançarski, Stéphane
Metadata
Show full item recordAbstract
Due to the growing importance of the World Wide Web,
archiving the web has become a cultural necessity in preserving
knowledge. To maintain a web archive up-to-date,
crawlers harvest the web by iteratively downloading new versions
of documents. However, it is frequent that crawlers
retrieve pages with unimportant changes such as advertisements
which are continually updated. Hence, web archive
systems waste time and space for indexing and storing useless
page versions. In this paper, we present a novel approach
that detects important changes between versions in order to
efficiently archive the web. Our approach combines the concept
of the visual pages segmentation with the concept of
importance while detecting changes between versions. The
approach consists of archiving the visual layout structure
of a web page represented by semantic blocks. We propose
an adequate changes detection algorithm to compute differences
between these visual layout structures of documents.
We describe also a method to evaluate the importance of detected
changes. Tests were conducted to evaluate the feasibility
of our approach. Experimental results show promising
performances of our approach.