Author Topic: Parsing out unique keywords  (Read 11186 times)

Offline sharonarchives

  • Newcomer
  • *
  • Posts: 8
    • View Profile
Parsing out unique keywords
« on: July 02, 2013, 02:31:09 PM »
I have a keyword metadata question.

I have a collection of images that I've keyworded using Photo Mechanic.

I want to extract a list of the unique keywords applied to that collection (so as to do cleanup, etc.).

Initially I just typed in keywords per image into the IPTC info box (separated by semicolons). I later (about midway into project) created a keyword hierarchy that I now draw on to apply keywords more uniformly. I'm interested in extracting a list of all the unique keywords used on the collection in order to do a cleanup on the earlier manually entered keywords.

I did a lot digging online to figure out how to extract such a list, and found that I could use ExifTool and run a command line prompt to get the list. I downloaded ExifTool and ran the following command:

Image-ExifTool-9.31/exiftool -r -q -T -Keywords /Users/sharon/Desktop | perl -pe 's/, /\n/g;' | sort -u

For the most part, it works and parses out the keyword strings to generate a unique keyword list! However, a few, maybe 10%, of the image keywords do not parse. Instead, the entire keyword string shows up in my list of unique keywords. I can't figure out why that is. There's nothing consistent about how I applied the keywords in the unparsed strings, nor is there anything consistent about the contents of those strings (no specific keywords or special character, etc.).

Any thoughts on what might be different about the particular images whose keywords will not parse? Anyone run into similar challenges?

Thank you!
Sharon

Offline Bob M

  • Full Member
  • ***
  • Posts: 153
    • View Profile
    • The McElroys of Point Alexander
Re: Parsing out unique keywords
« Reply #1 on: July 02, 2013, 03:25:05 PM »
If you separated your keywords with semicolons, shouldn't your perl regular expression be s/; /\n/g rather than s/, /\n/g ?

Offline Kirk Baker

  • Senior Software Engineer
  • Camera Bits Staff
  • Superhero Member
  • *****
  • Posts: 25020
    • View Profile
    • Camera Bits, Inc.
Re: Parsing out unique keywords
« Reply #2 on: July 02, 2013, 03:41:40 PM »
Keyword separators are not in the actual metadata, they're a convention for parsing them so PM can figure out when a new keyword starts.

What does the keyword list look like if you leave the regex out?

Image-ExifTool-9.31/exiftool -r -q -T -Keywords /Users/sharon/Desktop

It may be that some of them are formatted incorrectly or are hierarchical keywords (paths) separated by '|' characters.

-Kirk

Offline sharonarchives

  • Newcomer
  • *
  • Posts: 8
    • View Profile
Re: Parsing out unique keywords
« Reply #3 on: July 03, 2013, 08:49:02 AM »
Thanks for the responses.

I ran the command with the regex left out (Image-ExifTool-9.31/exiftool -r -q -T -Keywords /Users/sharon/Desktop) and the list I get back looks like it's all the keyword strings for all the images in the folder, not parsed at all with repeats showing (rather than unique strings). I did notice both comma and semicolon separated keyword strings, so clearly we've been inconsistent about how we've inputted keywords.

Which brings me to Bob M's suggestion of running the command with a semicolon in the perl regular expression rather than a comma:
Image-ExifTool-9.31/exiftool -r -q -T -Keywords /Users/sharon/Desktop | perl -pe 's/; /\n/g;' | sort -u

That alone didn't solve it bc now the command is parsing out the semi-colon deliminated keyword strings but leaving the comma separated ones behind.

I'm new to all this. Can someone let me know how to adjust the command so that it parses based on both semicolons and commas?! If not, I can do a replace all in Photomechanic to get the deliminators consistent, but as we're talking about thousands of images, not ideal.

Thanks so much!
Sharon

Offline Kirk Baker

  • Senior Software Engineer
  • Camera Bits Staff
  • Superhero Member
  • *****
  • Posts: 25020
    • View Profile
    • Camera Bits, Inc.
Re: Parsing out unique keywords
« Reply #4 on: July 03, 2013, 10:49:04 AM »
Sharon,

Thanks for the responses.

I ran the command with the regex left out (Image-ExifTool-9.31/exiftool -r -q -T -Keywords /Users/sharon/Desktop) and the list I get back looks like it's all the keyword strings for all the images in the folder, not parsed at all with repeats showing (rather than unique strings). I did notice both comma and semicolon separated keyword strings, so clearly we've been inconsistent about how we've inputted keywords.

Which brings me to Bob M's suggestion of running the command with a semicolon in the perl regular expression rather than a comma:
Image-ExifTool-9.31/exiftool -r -q -T -Keywords /Users/sharon/Desktop | perl -pe 's/; /\n/g;' | sort -u

That alone didn't solve it bc now the command is parsing out the semi-colon deliminated keyword strings but leaving the comma separated ones behind.

I'm new to all this. Can someone let me know how to adjust the command so that it parses based on both semicolons and commas?! If not, I can do a replace all in Photomechanic to get the deliminators consistent, but as we're talking about thousands of images, not ideal.

Try changing your command to:

Image-ExifTool-9.31/exiftool -r -q -T -Keywords /Users/sharon/Desktop | perl -pe 's/,; /\n/g;' | sort -u

or if that doesn't work (I'm not a perl guy at all) try this:

Image-ExifTool-9.31/exiftool -r -q -T -Keywords /Users/sharon/Desktop | perl -pe 's/[,; ]+/\n/g;' | sort -u

Now the bigger issue is that I think that at some point you changed your preference from commas to semi-colons and continued to use commas when you entered keywords.  Thus PM interpreted your words separated by commas as a single keyword instead of as separate keywords.  This may have happened in the opposite direction as well (pref: comma, entered: semi-colon).

I suppose you could try to use the Find/Replace panel, searching only the keywords field for commas and replacing them with nothing, and then searching for semi-colons and replacing them with nothing.  That should 'repair' the files, but it will also remove the commas and semi-colons indiscriminately.  If you really intended to have commas or semi-colons between two (or more) words then they'll be removed.

HTH,

-Kirk

Offline sharonarchives

  • Newcomer
  • *
  • Posts: 8
    • View Profile
Re: Parsing out unique keywords
« Reply #5 on: July 03, 2013, 11:38:55 AM »
Thanks for the response, Kirk.

The first suggestion you made doesn't work, I just get full keywords strings (comma separated and semicolon separated).

The second suggestion parses everything and brings up a unique list but it parses everything separated by a space as well, so the keyword "Annie Lebowitz" - for instance - comes up as two separate keywords:
Annie
Lebowitz

When you say we changed our preference from commas to semi-colons do you mean that we did so as a setting in PM? And when you say that PM interpreted our words separated by commas/semi-colons as a single keyword instead of separate keywords, do you mean that "Annie Lebowitz" - separated by a space but surrounded by commas/semicolons would therefore be seen as two keywords: Annie and Lebowitz? We have a lot of multi-word keywords, so this will come up a lot.

Why wouldn't it work to do a find and replace to find all commas and replace with semicolons so that everything is separated in the same way, and then run the command line with a perl expression to just parse based on semicolons to pull the unique list?

Would love to understand more, thanks for your help.
Sharon

Offline Kirk Baker

  • Senior Software Engineer
  • Camera Bits Staff
  • Superhero Member
  • *****
  • Posts: 25020
    • View Profile
    • Camera Bits, Inc.
Re: Parsing out unique keywords
« Reply #6 on: July 03, 2013, 12:10:50 PM »
Sharon,

Thanks for the response, Kirk.

The first suggestion you made doesn't work, I just get full keywords strings (comma separated and semicolon separated).

The second suggestion parses everything and brings up a unique list but it parses everything separated by a space as well, so the keyword "Annie Lebowitz" - for instance - comes up as two separate keywords:
Annie
Lebowitz

Then try this:

Image-ExifTool-9.31/exiftool -r -q -T -Keywords /Users/sharon/Desktop | perl -pe 's/[,;]+[ ]*/\n/g;' | sort -u

When you say we changed our preference from commas to semi-colons do you mean that we did so as a setting in PM?

That is my hunch, yes.

And when you say that PM interpreted our words separated by commas/semi-colons as a single keyword instead of separate keywords, do you mean that "Annie Lebowitz" - separated by a space but surrounded by commas/semicolons would therefore be seen as two keywords: Annie and Lebowitz?

No.  If for instance you had PM set to interpret semi-colons as separators and entered the following:

Annie Lebowitz, photographer

Then PM would interpret the entire thing as a single keyword since it found no semi-colons.

If you had PM set to interpret commas as separators and entered the following:

Annie Lebowitz; photographer

Then PM would interpret the entire thing as a single keyword since it found no commas.

We have a lot of multi-word keywords, so this will come up a lot.

Why wouldn't it work to do a find and replace to find all commas and replace with semicolons so that everything is separated in the same way, and then run the command line with a perl expression to just parse based on semicolons to pull the unique list?

In general, the commas or semi-colons don't actually exist in the metadata at all, unless the above two examples were occurring.  I would venture a guess that the great majority of your images do not have embedded commas or semi-colons in them and that only a few do.  You could identify them and fix them by hand on a case-by-case basis.  If you had either of the above examples and did two finds (one for comma and one for semi-colon) and simply deleted them then you would have this as your keyword:  "Annie Lebowitz photographer" which is likely what you don't want.  Just do the Find (no replace) and then edit the IPTC Info for the images that are selected after the Find.  You should be able to search for "Any of the words" and enter ", ;" as your Find text (don't enter the quotes though) and you should find both cases with one Find.

-Kirk

Offline sharonarchives

  • Newcomer
  • *
  • Posts: 8
    • View Profile
Re: Parsing out unique keywords
« Reply #7 on: July 03, 2013, 01:24:46 PM »
Okay, the command worked with that change to the perl regular expression, thank you so much for the help!

And thanks for the explanation about the commas/semicolons within keywords. We don't have any situations like "Annie Lebowitz, photographer" so no cleanup necessary there.

So in that case, we just have some inconsistency in terms of deliminators, sometimes we used commas and sometimes we used semicolons. You're probably right that at some point we changed our preferences.

Just to better understand what you meant by "In general, the commas or semi-colons don't actually exist in the metadata at all" we tried a find on ";" and it brought up all images that had ; as a deliminator.

Is there a way to get them to be uniform (all commas or all semicolons?) that's straightforward? Or does it not really matter that we use them interchangeably so long as we've done it consistently within a single image's keyword field on the IPTC panel? (all commas or all semicolons for each individual image)?

Thanks for all the help,
Sharon

Offline Kirk Baker

  • Senior Software Engineer
  • Camera Bits Staff
  • Superhero Member
  • *****
  • Posts: 25020
    • View Profile
    • Camera Bits, Inc.
Re: Parsing out unique keywords
« Reply #8 on: July 03, 2013, 02:31:36 PM »
Sharon,

Just to better understand what you meant by "In general, the commas or semi-colons don't actually exist in the metadata at all" we tried a find on ";" and it brought up all images that had ; as a deliminator.

When the keywords you enter are delimited by the same character you have told PM that you're using, the metadata will not have any of the delimiters in it at all.  Keywords in the metadata are a list of repeating fields.  They usually don't have commas or semi-colons in them.  When you use the opposite delimiter, then the list of repeating fields will have the delimiters present.  Why is this?  Because PM doesn't know where one keyword starts and ends and makes the entire set of keywords into a single keyword item and the incorrect delimiters will be present in the string of letters, spaces, and any other punctuation.

For instance if you have PM set to use commas as a delimiter and you enter:

"Sharon; Archives; Photography; Birds; Hummingbird"

Then the keywords list in the metadata will be a single item:

Sharon; Archives; Photography; Birds; Hummingbird

When you use exiftool to display the keywords on this item, it will find a single keyword in the keywords list and it will be:

Sharon; Archives; Photography; Birds; Hummingbird

Does that make sense?

Is there a way to get them to be uniform (all commas or all semicolons?) that's straightforward? Or does it not really matter that we use them interchangeably so long as we've done it consistently within a single image's keyword field on the IPTC panel? (all commas or all semicolons for each individual image)?

Yes.  I explained how to do so in my last message.  I'll repost it here:

Quote from: kbaker
Just do the Find (no replace) and then edit the IPTC Info for the images that are selected after the Find.  You should be able to search for "Any of the words" and enter ", ;" as your Find text (don't enter the quotes though) and you should find both cases with one Find.

You would then open up the IPTC Info (click on the 'i' button that appears when you put your mouse cursor over a selected image), go to the Keywords field, and repair it manually.  If you use the 'Save & ->' button, you should now get the next selected image loaded, ready for correction.  Repeat until all are corrected.

HTH,

-Kirk

Offline sharonarchives

  • Newcomer
  • *
  • Posts: 8
    • View Profile
Re: Parsing out unique keywords
« Reply #9 on: July 03, 2013, 02:42:32 PM »
Thanks, Kirk.

The challenge is we're talking about hundreds of images that would need to be cleaned so that semicolon is replaced by comma, or the opposite. I was hoping there'd be a way to do the change en masse.

Thoughts?
Sharon

Offline Kirk Baker

  • Senior Software Engineer
  • Camera Bits Staff
  • Superhero Member
  • *****
  • Posts: 25020
    • View Profile
    • Camera Bits, Inc.
Re: Parsing out unique keywords
« Reply #10 on: July 03, 2013, 03:03:30 PM »
Sharon,

The challenge is we're talking about hundreds of images that would need to be cleaned so that semicolon is replaced by comma, or the opposite. I was hoping there'd be a way to do the change en masse.

If you know you have no instances where you really want a comma or a semi-colon to exist between two words and have it continue to be treated as a single keyword, then you can use the Find and Replace panel (Edit menu).

First make sure you know what your chosen delimiter is (comma or semi-colon) by looking at the IPTC/XMP section of the Preferences dialog.

Then do a Find and Replace for the character you don't want, replacing with the chosen delimiter.  That should do it in bulk.

-Kirk

P.S.  I created a set of keywords in an image separated by semi-colons, had already chosen comma as my delimiter, did a find for ";" and a replace with "," and it repaired the image just fine.

Offline sharonarchives

  • Newcomer
  • *
  • Posts: 8
    • View Profile
Re: Parsing out unique keywords
« Reply #11 on: July 03, 2013, 03:06:42 PM »
Great! Thanks for the instructions and for all the help. Heading out for the holiday weekend, but I'll give this a shot next week and will let you know if we run into any challenges moving ahead.

Thanks again, Happy 4th!
Sharon

Offline sharonarchives

  • Newcomer
  • *
  • Posts: 8
    • View Profile
Re: Parsing out unique keywords
« Reply #12 on: July 08, 2013, 03:10:39 PM »
Hi Kirk,

Quick followup question for you, and I write this knowing you work for PM, not exiftool. Any reason you can think of that using exiftool to extract keywords we inputted using PM would skip over keywords for .png files? We're still doing some testing, but it seems that might be the case.

Thoughts?
Sharon

Offline Kirk Baker

  • Senior Software Engineer
  • Camera Bits Staff
  • Superhero Member
  • *****
  • Posts: 25020
    • View Profile
    • Camera Bits, Inc.
Re: Parsing out unique keywords
« Reply #13 on: July 08, 2013, 04:30:30 PM »
Sharon,

Quick followup question for you, and I write this knowing you work for PM, not exiftool. Any reason you can think of that using exiftool to extract keywords we inputted using PM would skip over keywords for .png files? We're still doing some testing, but it seems that might be the case.

I haven't done much with png files with exiftool and I don't know how well they work in practice.

-Kirk

Offline sharonarchives

  • Newcomer
  • *
  • Posts: 8
    • View Profile
Re: Parsing out unique keywords
« Reply #14 on: July 08, 2013, 04:33:37 PM »
Fair enough, thanks for the response, much appreciated!