Sovereign Hill's photographic archive holds more than 40,000 digital image files. Staff there, along with counterparts at the Art Gallery of Ballarat on Lydiard Street North, have known for years that a significant portion of those files are duplicates — the same shot saved twice, three times, sometimes more, under different filenames after successive scanning drives and system migrations. What nobody had properly counted, until recently, was how many.
The problem is not unique to Ballarat, but it lands with particular weight here. The city's institutions carry an outsized responsibility for preserving gold rush–era photography, colonial records and heritage documentation that nowhere else holds. When those archives are cluttered with duplicate files, staff waste hours searching, storage budgets blow out, and the risk of accidentally deleting the only good copy of an irreplaceable image rises sharply.
What the Data Actually Shows
Research published in 2024 by the Digital Preservation Coalition found that duplicate image files typically account for between 18 and 34 percent of total digital storage in cultural heritage collections that have undergone multiple migration cycles without a formal deduplication policy. Apply the lower end of that range to a collection of 40,000 files and you are looking at roughly 7,200 redundant images consuming server space, slowing catalogue searches and muddying provenance records.
Storage costs in institutional environments — factoring in redundancy, backup and security compliance — run at roughly $80 to $120 per terabyte per year in Australian cloud infrastructure contracts, according to industry pricing published by the Australian Government's Digital Transformation Agency. High-resolution heritage scans average between 50 and 150 megabytes per file. A bank of 7,000 duplicate files at that resolution can easily consume an additional one to two terabytes annually, meaning a mid-sized regional institution may be spending between $80 and $240 per year purely on storing images it already has.
The dollar figure sounds modest. Multiply it across multi-decade contracts and combine it with staff time — archivists at Ballarat's institutions earn between $65,000 and $85,000 annually under Victorian public sector classifications — and the waste compounds quickly. An archivist spending four hours per week navigating duplicate files is losing roughly 200 hours of productive cataloguing time per year.
The University of Melbourne's cultural collections unit completed a deduplication project across its photographic holdings in late 2024, removing more than 12,000 redundant files from a 90,000-image archive. The process took three months using a combination of perceptual hashing software and manual review, and freed up 1.8 terabytes of storage. That project is now being studied by regional institutions around Victoria as a potential template.
What Ballarat Institutions Are Doing About It
The Art Gallery of Ballarat, which has been undertaking a broader digital infrastructure review since its 2023–24 capital works period, is understood to be assessing deduplication software options as part of its collections management upgrade. The Ballarat Clarendon College Foundation separately manages photographic records going back to the 1880s and faces the same structural issue — files ingested from donor collections rarely come with deduplication checks applied.
Sovereign Hill, whose archive underpins its interpretive programs on the Sturt Street end of the visitor precinct, has a particular challenge: many of its images exist in both high-resolution master form and in lower-resolution copies produced for web and print use. These near-duplicates are harder to catch with automated tools because they are not pixel-identical, and they require human review to assess which version should be retained as the archival standard.
The practical path forward for Ballarat's institutions runs through a combination of open-source tools — ExifTool and DupeGuru are both free and widely used in the heritage sector — and a minimum viable policy requiring that any new image ingested into a collection is checked against existing holdings before it is saved. The State Library of Victoria distributes guidance on exactly this process through its Regional Library Corporation program, and Ballarat Community Library is already a member of that network.
The window for action is narrower than it looks. As collections grow and file counts increase, retrospective deduplication becomes exponentially more time-consuming. Institutions that defer the work past their next major system migration will likely face a far larger problem on the other side of it.