Migrating Content in WARC Files
Beran, Peter Paul
MetadataShow full item record
Heritage institutions all over the world started on harvesting and preserving resources of the World Wide Web for future generations as part of our culture heritage. This task tends to be a non-trivial one because of two complex challenges: (1) crawling the enormous data amount located in the Internet and (2) performing long term preservation strategies on these data. Nowadays a lot of effort is made in the development ofWeb crawlers and there exist many years’ experience with bit storage of large data amounts. However the support for the logical preservation of Internet archives is very limited. The continuous development of technologies that are used in the Web and especially the rapid change in using a tremendous variety of different file formats put the digital assets in the Web archives at risk of becoming inaccessible and unusable in the near future. This paper presents a workflow to apply digital preservation strategies on the content of WARC archives. The migration of the objects within a WARC archive allows accessing and using the information in the future. The new WARC format that is widely used to store Internet crawl results supports migration of its content. Moreover a set of tools is presented that supports the extraction, migration and injection of objects in WARC files.