Author Topic: Speech to Text  (Read 163 times)

Offline Claude Diderich

  • Newcomer
  • *
  • Posts: 41
  • Claude Diderich
    • View Profile
    • Claude Diderich Sports Pictures
Speech to Text
« on: May 02, 2025, 04:10:47 AM »
I would love to see the following feature in an upcoming version of PM. I believe the speech to text technology (AI models, even available in open source form) has become sufficiently mature for such a feature to be implemented efficiently and effectively.
  • The user defines the language model to be used for translating spoken words into written text (Preferences).
  • The user enters the variable {wav} (or any other naming) in any of the metadata fields.
  • If a .WAV file exists for a given photo, the variable {wav} is replaced by the text converted from the .WAV file (using ??? for unrecognisable words) when the code/variables are evaluated.

Use case
  • I select "English" as the language model to use to translate speech to text in the "Preferences" ahead of the game.
  • I take a picture during the basketball game between France and Germany and voice tag it with a note "Peter Pan of team France scores".
  • When ingesting the photos I set the caption in the "Metadata (IPTC) Template" to "{wav} during the basketball game between France and Germany on May 2, 2025 in Paris, France".
  • When evaluating the codes/variables {wav} is replaced by the speech to text converted string "Peter Pan of team France scores".
  • The final caption now reads, without having to type any text, "Peter Pan of team France scores during the basketball game between France and Germany on May 2, 2025 in Paris, France".
« Last Edit: May 02, 2025, 07:06:07 AM by Claude Diderich »
Claude Diderich, sports photographer, member of AIPS and sportpress.ch
Mülibachstrasse 49, CH-8805 Richterswil, Switzerland, phone: +41 (44) 450 81 66, fax: +41 (44) 450 81 19, mobile: +41 (79) 450 81 66, e-mail: cdiderich@cdsp.photo, internet: http://www.cdsp.photo/

Offline Kirk Baker

  • Senior Software Engineer
  • Camera Bits Staff
  • Superhero Member
  • *****
  • Posts: 25236
    • View Profile
    • Camera Bits, Inc.
Re: Speech to Text
« Reply #1 on: May 02, 2025, 08:28:24 AM »
Claude,

I would love to see the following feature in an upcoming version of PM. I believe the speech to text technology (AI models, even available in open source form) has become sufficiently mature for such a feature to be implemented efficiently and effectively.
  • The user defines the language model to be used for translating spoken words into written text (Preferences).
  • The user enters the variable {wav} (or any other naming) in any of the metadata fields.
  • If a .WAV file exists for a given photo, the variable {wav} is replaced by the text converted from the .WAV file (using ??? for unrecognisable words) when the code/variables are evaluated.

Use case
  • I select "English" as the language model to use to translate speech to text in the "Preferences" ahead of the game.
  • I take a picture during the basketball game between France and Germany and voice tag it with a note "Peter Pan of team France scores".
  • When ingesting the photos I set the caption in the "Metadata (IPTC) Template" to "{wav} during the basketball game between France and Germany on May 2, 2025 in Paris, France".
  • When evaluating the codes/variables {wav} is replaced by the speech to text converted string "Peter Pan of team France scores".
  • The final caption now reads, without having to type any text, "Peter Pan of team France scores during the basketball game between France and Germany on May 2, 2025 in Paris, France".

I have been looking into them over the last year or so.  They have been very hit or miss in my testing, especially with recordings from actual users out in the field with noise in the environment.  If you come across one that produces accurate transcription in noisy environments for more than one language (usually English) then please let me know.

-Kirk

Offline Kevin M. Cox

  • Hero Member
  • *****
  • Posts: 554
  • 2025.1 (8239) | macOS 15.4.1
    • View Profile
    • Kevin M. Cox | Photojournalist
Re: Speech to Text
« Reply #2 on: May 03, 2025, 08:21:27 PM »
This would be awesome, but as Kirk said, in some of these environments I can barely hear/understand my own audio notes!
Kevin M. Cox | Photojournalist
https://www.instagram.com/kevin.m.cox/

Offline Kirk Baker

  • Senior Software Engineer
  • Camera Bits Staff
  • Superhero Member
  • *****
  • Posts: 25236
    • View Profile
    • Camera Bits, Inc.
Re: Speech to Text
« Reply #3 on: May 04, 2025, 08:15:38 AM »
This would be awesome, but as Kirk said, in some of these environments I can barely hear/understand my own audio notes!

I used the "Whisper" AI to process some of Kevin's sample annotations and it hardly got any of the words right.

-Kirk