Solving A Web Archiving Mystery
Today’s post is authored by NYU Library’s Web Archivist Nicole Greenhouse.
Amy Vo, the Cold War Collections Project Archivist, reached out to me the other day asking me about an archived website that is part of a collection that she is working on. Her inquiry led me down a little mystery, which does a great job highlighting the history of web archiving at NYU and my current work.
As the Web Archivist in Archival Collections Management, my job is to administer the web archives for the Special Collections. Web archiving is the process of selection and capture of web published materials–like websites–for long-term preservation. NYU has been archiving websites since 2007, as the result of a partnership with the California Digital Library as part of the Web-at-Risk Project.1 2 This project led to the development of the Web Archiving Service (WAS), which the Tamiment Library & Robert F. Wagner Labor Archives used to capture websites up until 2015. Starting in 2011, the University Archives and other departments in the NYU Libraries began using Archive-It. As of 2019, NYU Special Collections have captured over 12.5 TBs of web data.
Amy was working on the collection WAG 180, the records of Sheet Metal Workers International Association, Local 137. Associated with that collection was this link: http://webarchives.cdlib.org/site/sw1sf2mc4t. When I clicked on the link, it redirected to this page:
In 2015, the Web Archiving Service ceased operations, and Tamiment migrated their websites to Archive-It. So now my job was to start investigating what happened to this website that was meant to be migrated.
Lucky for me, there was a lot of internal documentation on the migration of websites into Archive-It. According to a report from WAS, the local’s website had been captured a total of 25 times (the numbers in parentheses), under two different URLs, https://local137.com/ and http://smart137.com).
Besides the different URLs and number of crawls, the report also gave me clues on how Tamiment scoped the crawls using WAS. For instance, “preserved” denoted that the website was not actively crawled, making it likely that https://local137.com/ was a dead link. However, it appeared that the website was captured on a quarterly basis (four times a year) and that the website was supposed to capture the host website (smart137.com) plus any pages that linked from the main host.
Now that I had the URLs, I was able to check and see if they were captured in our Archive-It account. The majority of our labor union related websites are in the Labor Unions and Organizations (U.S.) Collection: https://archive-it.org/collections/6349. Through this link, you can browse and search the collection to historically see what has been captured by Tamiment from 2007 to today.
However, I just wanted to investigate whether or not we had the website in our collection. To just check to see if the URL in the collection, I combine the prefix to the Archive-It calendar page (https://wayback.archive-it.org/), the unique identifier to the collection (6349), and the URL of the potential archived website. So to check if the original URL for the local was captured, I created this link: https://wayback.archive-it.org/6349/*/https://local137.com/. And we see that there is one capture associated this URL in the collection:
When you click on the date, you can see what the local’s website looked like in 2007:
I also checked for the URL to http://smart137.com, and as of early June, there were no crawls in the collection associated with that URL.
Interestingly and unfortunately, it seems like the crawls that were mostly captured by WAS did not successfully migrate into Archive-It. However, all is not lost, I began recrawling the http://smart137.com/ and it appears that both websites can be found in the Wayback Machine, which includes Archive-It partner crawls, as well as World Wide Web crawls implemented by the Internet Archive itself. You can check out these crawls here: https://web.archive.org/web/20170501000000*/http://smart137.com/ and https://web.archive.org/web/20170501000000*/https://local137.com/.
In addition to solving these mysteries, I am responsible for describing the websites in context of our analog collections, providing quality assurance on archived websites so they appear as close to the live website as possible, as well as crawling and general maintenance of the archived websites. To explore the entirety of our public collections, visit NYU’s Archive-it portal. If you have any web archiving questions, feel free to reach out to me, I love talking about this subject!