Author Topic: Speech to Text (Read 7880 times)

Claude Diderich · « **on:** May 02, 2025, 04:10:47 AM »

I would love to see the following feature in an upcoming version of PM. I believe the speech to text technology (AI models, even available in open source form) has become sufficiently mature for such a feature to be implemented efficiently and effectively.

The user defines the language model to be used for translating spoken words into written text (Preferences).
The user enters the variable {wav} (or any other naming) in any of the metadata fields.
If a .WAV file exists for a given photo, the variable {wav} is replaced by the text converted from the .WAV file (using for unrecognisable words) when the code/variables are evaluated.

Use case

I select "English" as the language model to use to translate speech to text in the "Preferences" ahead of the game.
I take a picture during the basketball game between France and Germany and voice tag it with a note "Peter Pan of team France scores".
When ingesting the photos I set the caption in the "Metadata (IPTC) Template" to "{wav} during the basketball game between France and Germany on May 2, 2025 in Paris, France".
When evaluating the codes/variables {wav} is replaced by the speech to text converted string "Peter Pan of team France scores".
The final caption now reads, without having to type any text, "Peter Pan of team France scores during the basketball game between France and Germany on May 2, 2025 in Paris, France".

Kirk Baker · « **Reply #1 on:** May 02, 2025, 08:28:24 AM »

Claude,

Quote from: Claude Diderich on May 02, 2025, 04:10:47 AM

I would love to see the following feature in an upcoming version of PM. I believe the speech to text technology (AI models, even available in open source form) has become sufficiently mature for such a feature to be implemented efficiently and effectively.
The user defines the language model to be used for translating spoken words into written text (Preferences).
The user enters the variable {wav} (or any other naming) in any of the metadata fields.
If a .WAV file exists for a given photo, the variable {wav} is replaced by the text converted from the .WAV file (using for unrecognisable words) when the code/variables are evaluated.

Use case
I select "English" as the language model to use to translate speech to text in the "Preferences" ahead of the game.
I take a picture during the basketball game between France and Germany and voice tag it with a note "Peter Pan of team France scores".
When ingesting the photos I set the caption in the "Metadata (IPTC) Template" to "{wav} during the basketball game between France and Germany on May 2, 2025 in Paris, France".
When evaluating the codes/variables {wav} is replaced by the speech to text converted string "Peter Pan of team France scores".
The final caption now reads, without having to type any text, "Peter Pan of team France scores during the basketball game between France and Germany on May 2, 2025 in Paris, France".

I have been looking into them over the last year or so. They have been very hit or miss in my testing, especially with recordings from actual users out in the field with noise in the environment. If you come across one that produces accurate transcription in noisy environments for more than one language (usually English) then please let me know.

-Kirk

Kevin M. Cox · « **Reply #2 on:** May 03, 2025, 08:21:27 PM »

This would be awesome, but as Kirk said, in some of these environments I can barely hear/understand my own audio notes!

Kirk Baker · « **Reply #3 on:** May 04, 2025, 08:15:38 AM »

Quote from: Kevin M. Cox on May 03, 2025, 08:21:27 PM

This would be awesome, but as Kirk said, in some of these environments I can barely hear/understand my own audio notes!

I used the "Whisper" AI to process some of Kevin's sample annotations and it hardly got any of the words right.

-Kirk

Camera Bits Forums

News:

Author Topic: Speech to Text (Read 7880 times)

Claude Diderich

Speech to Text

Kirk Baker

Re: Speech to Text

Kevin M. Cox

Re: Speech to Text

Kirk Baker

Re: Speech to Text