Web archiving program
What is the web archiving program?
Since 2008, the University of Melbourne's Web Archiving Program has captured and preserved over 800 University websites using the Internet Archive’s Archive-It service.
The program was previously administered by Records & Information until 2024.
The program is presently administered by the University of Melbourne Archives (UMA).
How do I search for archived content?
Archived content can be found on the Internet Archive's website using their Wayback Machine tool. Simply enter the URL you wish to find and press enter.
You can also browse and search via the University of Melbourne collections in the Archive-It portal.
How is content selected for archiving?
In the past, all University domains could be requested for capture.
From 2023, only the following type of web content will be added for capture:
- Content which is publicly available (not internal sites which require Single Sign-On)
- Content classed as 'Permanent' in the University Records Retention and Disposal Authority (RDA)
Note: Existing domains will continue to be captured. Duplicate domains will be deactivated, when identified and where appropriate.
How do I request content be added for archiving?
If the content meets the criteria outlined above, then it can be captured and added to the University's Web Archive.
Send an email to the UMA via um-archives@unimelb.edu.au, including the following information:
- URL
- Applicable permanent RDA class
- Capture frequency (once-off or quarterly)
UMA will then check the content is not already being archived. If not, they will undertake a test crawl which will be sent to the requester for their validation. Once validated, the crawl will be set up within the Archive-It system.
How is content captured?
Collection groups
Domains are grouped into collections based on whether it is a Faculty, Department, School, Centre, Institute website, or administrative function based website such as Student Administration, Student Services, External Relations and Information Management.
To find a website, you do not need to know what collection a website is a part of. Instead, you only need to enter in the URL to find out whether or not it has been archived.
Capture timings and quality assurance
Captures (or crawls) are set to run for different time periods to ensure that a crawl has time to capture all the pages on all the websites within a collection on a regular basis.
What technology is used?
Tools
The program uses the Archive-It web harvesting solution, which systematically retrieves each page on a specified domain and saves a copy. Content is collected and stored according to international standards for digital preservation and access.
The Archive-It service uses a number of open source components including:
- Heretrix web crawler to collect web content
- NutchWAX indexing engine to provide search services
- Wayback to provide the user interfaces.
Limitations
Archive-It works most efficiently when capturing static pages of text. Although Archive-It is flexible, there are some limitations and the following types of content cannot be captured by the tools:
- Pages requiring single sign-on (SSO) access (such as intranet or Research Gateway content)
- Some media, such as video and images
- Pages with dynamic content such as databases and directories with search features.
Is University web content archived by other organisations?
Yes. Collecting publicly accessible web content for a variety of purposes is undertaken by a number of external agencies including:
- Internet Archive (search via Wayback Machine)
- Google Cache (search via Google and enter cache:[URL])
- National Library of Australia (search via Trove)
Note: The University of Melbourne exercises little if any control over the behaviour of these organisations, and is not responsible for their information management policies and procedures, or the availability of their services.
Further information
Please direct enquiries regarding the Web Archiving Program to University of Melbourne Archives via um-archives@unimelb.edu.au.