At present, the models are improving at an almost incredible rate. The likes of apple will want to run applications of these models on phones, never mind laptops. As you will be aware, the next release of MacOS will provide new AI-based services, running what can be run on local device and offloading more complex things to the apple cloud. I know that you prefer to roll your own stuff, but you could provide hooks, or whatever.
As well as fiddling around with llava I use google’s image annotation api. It’s very capable (doing OCR, too) but the keywords that it produces don’t distinguish between the main subject and other incidental features (sky, car, …).
Thing about this key wording / captioning capability is that it’s redundant if the platform hosting your pictures can do all the image recognition itself…. What will be needed is the ability to be very specific about the contents of the picture (name of the player, location, etc).
Anyway, it might be wise the keep to the meat and potatoes capabilities of the current program, while testing out some more experimental features, to avoid obsolescence.