by Andrew Weaver
This is the fourth post in the bloggERS Script It! Series.
As the Digital Infrastructure and Preservation Librarian at Washington State University Libraries, I am responsible for helping to ensure the integrity and provenance of all digital files maintained in our archives. Recently a lot of my work has involved processing legacy materials in order to generate or expand upon their associated metadata. As much of this work is specific to certain file formats, I often find myself trying to hunt down all examples of a certain file type spread across thousands of directories representing many terabytes worth of data. To do this I have been using the Bash
find command. This allows me to generate lists of file paths which I can then loop through to perform actions as needed, such as file validation and metadata extraction. As I imagine that I am not the only one with vast amounts of legacy files to process, I wanted to write up a quick explanation of this method and hopefully save others some time!
For this example, I will be targeting
.wav files, but this process would work on any type of file. The main caveat to this method is that it involves locating files solely by file extension; this means that any files that have non-conforming extensions will not be discovered, and files that have been erroneously changed to the target extension will show up as false positives. If a higher level of certainty is desired for target files, an intermediate step could be to first run a loop verifying all results through a tool such as Seigfried or MediaConch.
A basic example of the find command is:
find "My-Folder" -iname "*.wav"
This command can be broken down as follows:
findinitiates the command.
My-Folderis the file path, either “absolute” or “relative,” to the target directory that will be searched. The “absolute” path is the entire path to your directory from the base of your system and can be quickly found by dragging your directory into your terminal. An example of what this might look like is
/home/username/Desktop/myfolder. The “relative” path is a path from where you already are, so for example, if your terminal was open to your Desktop, the relative path to the directory in the preceding example would just be
myfolder. The search is recursive, meaning that any sub-folders will also be searched. If
.is used instead of a specific directory, then the current directory will be searched. Note that if there are any spaces in your file path, you will have to put it in quotes.
findcommand that you will be searching for case insensitive terms in the filenames. This helps account for people mixing case, such as
.TIF, over the years.
findcommand the term that you will be searching for. In this instance, the
*is a wildcard, so the results will include any files that end in
Running this command recursively finds all WAV files (or again, more accurately, all files that have the
.wav extension) in the target directory and will spit out their file paths into your terminal. By itself this can be informative, but it really becomes useful when paired with a loop to process those results.
In this example I will show how to use the results of the previous command to batch perform the sample preservation action of embedding MD5 checksums into the discovered WAV files using BWF Metaedit (Download available here).
The entire command is:
find "/Path/To/My-Folder" -iname "*.wav" | while read targetfile ; do bwfmetaedit --MD5-Embed --reject-overwrite "$targetfile"; done
which can be broken down as follows:
find "/Path/To/My-Folder" -iname "*.wav" is the
find command demonstrated previously. This provides the file paths that will be used by the following commands.
| This is called a ‘pipe’ and is what Bash uses to send the output of one command to another. In this case, we are sending the output of the
find command into a loop. (This is the vertical bar symbol on the keyboard, not a capital I).
while read targetfile ; do This is what starts the loop. What this does is “read” each chunk of information that is coming from the pipe, in this case a file path to each matching file, and then assign each one to the variable “targetfile.” The name “targetfile” is an arbitrary name I chose for this example; it could be replaced with anything, such as “x” or “myfile,” as long as you are consistent with using the same variable name in the following steps.
bwfmetaedit --MD5-Embed --reject-overwrite "$targetfile" This is the actual preservation action that will be run on every file discovered by the
find command. The first part
bwfmetaedit --MD5-Embed is the BWF Metaedit command that embeds a checksum into each target WAV file.
--reject-overwrite is a safeguard that tells BWF Metaedit not to overwrite any existing information that might already be in the file. The second part
"$targetfile" makes the command run on the variable that was created in the previous step, with the result that BWF Metaedit will run on all the WAV files the
find command located. It is important not to forget the quotation marks around this variable; otherwise it won’t work on any filenames that have spaces in them. Since this variable represents file paths in the loop, it can be used in combination with any command you would want to run on your files!
; done ends the loop once it is done reading the output of the
Since all that needs to be done to use this technique in different contexts is to change the file extension in the initial
find command and then change the preservation action that targets the variable in the loop, I find that I am using similar constructions on an almost daily basis! By using
find loops I am able to save a lot of time (and pain) when dealing with legacy data. I hope this example will help others to do the same!
Andrew Weaver is the Digital Infrastructure and Preservation Librarian at Washington State University. He is from Seattle, and graduated with an MLIS from the University of Washington in 2015. Previously he has worked in the UW Libraries and at the CUNY TV Archive as part of the NDSR program.