File Triage

So in reading this week’s post on CLKF, I thought “Oh hey, that’ll be easy. The Linux command will work on the Mac and I’ll be done.” Well, as usual, nothing is ever easy.

So in testing the Hal’s final command, I created a bunch of files and removed the extensions and ran file against the folder.

$ file -p *
blue.ole: Zip archive data, at least v2.0 to extract
green.ole: Zip archive data, at least v2.0 to extract
red.ole: CDF V2 Document, Little Endian, Os: MacOS, Version 10.3, Code page: 10000, Template: Normal.dotm, Last Saved By: seth, Revision Number: 1, Name of Creating Application: Microsoft Macintosh Word, Create Time/Date: Tue Jul 27 19:18:00 2010, Last Saved Time/Date: Tue Jul 27 19:18:00 2010, Number of Pages: 1, Number of Words: 0, Number of Characters: 0, Security: 0

Wait, what? Zip archives? We know that one is a PowerPoint file and one is a Word document. Oh right, the new file format for Microsoft Office (.docx, .pptx, etc) are actually Zip archives.

Well that kinda puts a hamper on the Linux command if file doesn’t think they’re Office documents. What else can we do?

Enter Spotlight. Spotlight is the built in search on 10.4 and above. Among other things, it’s big feature is that it indexes the metadata in the file. On the command line, Apple has provided us with some tools to interface with it. The ones we’ll be using today are mdls (metadata list) and mdfind (metadata find).

$ mdls blue.ole
kMDItemContentCreationDate = 2010-07-28 14:28:52 -0400
kMDItemContentModificationDate = 2010-07-28 14:28:52 -0400
kMDItemContentType = “com.microsoft.word.openxml.document”
kMDItemContentTypeTree = (
“com.microsoft.word.openxml.document”,
“public.data”,
“public.item”,
“org.openxmlformats.openxml”,
“public.zip-archive”,
“com.pkware.zip-archive”,
“com.apple.bom-archive”,
“public.archive”
)
kMDItemDisplayName = “blue.ole”
kMDItemFSContentChangeDate = 2010-07-28 14:28:52 -0400
kMDItemFSCreationDate = 2010-07-28 14:28:52 -0400
kMDItemFSCreatorCode = “MSWD”
kMDItemFSFinderFlags = 16
kMDItemFSHasCustomIcon = 0
kMDItemFSInvisible = 0
kMDItemFSIsExtensionHidden = 1
kMDItemFSIsStationery = 0
kMDItemFSLabel = 0
kMDItemFSName = “blue.ole”
kMDItemFSNodeCount = 0
kMDItemFSOwnerGroupID = 20
kMDItemFSOwnerUserID = 102
kMDItemFSSize = 22875
kMDItemFSTypeCode = “WXBN”
kMDItemKind = “Microsoft Word document”
kMDItemSupportFileType = (
MDSystemFile
)

Wow, who would have thought all that metadata is stored in that little Word document. It’s clearly listed in several of those fields that we’re looking at a Microsoft Word document, so how do we get that data into a format to rename the files? With the file command it’s pretty easy, it lists the filename and then the type, so it’s easy to parse. The output from the metadata listing isn’t so easy, even if we prune down to just the Kind or Content Type which will give us the information we need.

Let’s start out by doing a search (mdfind) for the file types we want.

$ mdfind -onlyin “./” “Microsoft Word”
/clkf/e105/blue.ole
/clkf/e105/red.ole

Note: mdfind doesn’t play nice with ALL of the possible metadata attributes that mdls gives you; particularly it seems to have issues with the “FS” attributes (File System). If we needed to, we could get around this by parsing the output from mdls, but for the purposes of this exercise, that isn’t necessary.

Great, it works and outputs the paths to the two files we want. But it doesn’t include the file types with the path strings… this could be a problem. Fear not, loops to the rescue!

$ for f in PowerPoint Word Excel; do echo $f; mdfind -onlyin “./” $f; done
PowerPoint
/clkf/e105/green.ole
Word
/clkf/e105/blue.ole
/clkf/e105/red.ole
Excel

Ok, getting better. Now we know what’s a PowerPoint (green.ole), what’s a Word file (blue.ole, red.ole), and what is an Excel file (nothing). But we need a way to associate the correct extension with each loop, and we’d like to do it in such a way that doesn’t repeat everything three times… enter case!

$ for f in PowerPoint Word Excel; do case $f in Word) ext=”doc”;; Excel) ext=”xls”;; PowerPoint) ext=”ppt”;; esac; echo $f – $ext; mdfind -onlyin “./” $f; done
PowerPoint – ppt
/clkf/e105/green.ole
Word – doc
/clkf/e105/blue.ole
/clkf/e105/red.ole
Excel – xls

Better, better… I’m not a fan of ugly awk (sorry Hal), so lets throw some additional variables into the mix to make it a little cleaner…

$ for f in PowerPoint Word Excel; do case $f in Word) ext=”doc”;; Excel) ext=”xls”;; PowerPoint) ext=”ppt”;; esac; mdfind -onlyin “./” $f | awk -F. -v ext=”$ext” ‘ { system (“mv “$1″.ole “$1″.’$ext’”)}’; done

Hizza, the files are renamed!

Now the Mac really doesn’t care about file extensions, so this writeup is perhaps a bit redundant, but if we wanted to separate the new and old Office files we could compare the Kind metadata and figure out what correlates to each one, or we could look at the Type Code (kMDItemFSTypeCode) and could add these to differentiate on the type of Word document. I’ll leave that as an exercise to the reader.


This was written in response to the Command Line Kung Fu post here. Check it out for CLI management options for installed applications using cmd.exe, Windows PowerShell and Linux/Unix!

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>