Overview

What is the web archiving program?

The University of Melbourne's Web Archiving Program captures and preserves over 300 University websites using the Internet Archive’s Archive-It service.

The program is administered by Records & Information.

How did the program come about?

The program was initiated by Records & Information and began life as the Web Archiving Working Group in 2002, which recommended the establishment of the Web Archiving Strategy Project (WASP).

WASP commenced in mid 2003 and ran until 2007 and had 3 major phases:

  1. Research and development of a Web Archiving Policy
  2. Pilot of software solutions (such as PageVault, TRIM and Pandas)
  3. Implementation - this was postponed for some time. In 2007 a business case was developed for a Technical Web Archiving Solution, the outcome of which led to the purchase of a subscription to the Archive-It web application from the Internet Archive.

The Web Archiving Program in its current form officially began in January 2008.

How is content selected?

Content is selected according to the program's Collection Management Plan, which seeks to capture publicly available web pages that either contain university records or document university activities.

Two strategies are involved in selecting websites for archiving:

  • Whole of domain approach
  • Selective approach.

Domain captures

This strategy involves the capture of all websites on the University's domain and hundreds of unimelb subdomains on a quarterly basis. Captures provide a record of the University's web presence as at 1 January, 1 April, 1 July and 1 October each calendar year.

This approach provides a record of what was on a University website at the time of the capture and do not provide specific evidence of how many times a page may have been updated in between each quarter’s captures.

Selective captures

The selective captures strategy is for websites identified as being required to be captured on a more frequent basis, or for websites that are decommissioned between the quarterly captures.

This approach has been used to identify websites or web pages that may need to be captured from both a records management and risk management perspective. For example, some pages may need to be captured more frequently than the quarterly snapshots, such as the University's home page. This particular page is captured on a daily basis to help manage any reputational risk that the page may pose to the University. Over time, it also provides an historical record of the University's changing public face. The front page is changed almost daily in some instances and there is a strong case for its daily.

Web pages and websites are also captured because they contain University web records, as identified by the Enterprise Classification Scheme (ECS). The ECS provides a guide to identifying University web records and as a cross check to help ensure that the URLs which represent core University functions/activities are being captured.

How is content captured?

Collection groups

In order for the websites to be captured quickly and efficiently, they have been grouped into collections based on whether it is a Faculty, Department, School, Centre, Institute website, or administrative function based website such as Student Administration, Student Services, External Relations and Information Management.

To find a website, you do not need to know what collection a website is a part of, you only need to enter in the URL to find out whether or not it has been archived.

Capture timings and quality assurance

Captures (or crawls) are set to run for different time periods to ensure that a crawl has time to capture all the pages on all the websites within a collection on a regular basis.

After a crawl has been completed, a quality assurance audit is conducted on each seed that has been archived to check that the way it appears in the archive is the way it appears live on the web.

What technology is used?

Tools

The program uses the Archive-It web harvesting solution, which systematically retrieves each page on a specified domain and saves a copy. Content is collected and stored according to international standards for digital preservation and access.

The Archive-It service uses a number of open source components including:

  • Heretrix web crawler to collect web content
  • NutchWAX indexing engine to provide search services
  • Wayback to provide the user interfaces.

Volume limitations

Although Archive-It is extremely flexible, there are limitations in terms of the amount of pages that the University can capture under the terms of its service agreement. If you feel that you may have a page that requires a one off snapshot, or should be captured at a greater frequency, please fill in this form.

Is University web content archived by other organisations?

Yes. Collecting publicly accessible web content for a variety of purposes is undertaken by a number of external agencies including:

NameTo find University web content...
Internet ArchiveUse the URL of interest as a search term, eg, http://web.archive.org/web/*/www.unimelb.edu.au. Alternatively, use the Wayback Machine to locate content of interest
Google CacheUse the following query string: cache:www.unimelb.edu.au (or replace the www.unimelb.edu.au with the URL of your own choosing)
National Library of Australia's PANDORA web archiveUse the National Library's Trove discovery service to search for content

Note: The University of Melbourne exercises little if any control over the behaviour of these organisations, and is not responsible for their information management policies and procedures, or the availability of their services.

Recognition

In 2009, the University of Melbourne Web Archiving Program was awarded a Certificate of Commendation in the large government agency category of the Sir Rupert Hamer Records Management Awards.

The awards are administered by Public Record Office Victoria in recognition of recordkeeping excellence and innovation in the Victorian public sector.