There's various web based services now for transcribing voice, and probably some that are not web based.
Has it been considered to hook up to one and transcribe image's associated wav files to metdata, e.g. captions? Sure, it will make a mess of some words, but it seems likely to be easier than finding them all, listening, and typing from scratch?