It would be beneficial to have a node for:
Google Gemini multimodal (Vertex AI)
My use case:
In short: More cost-effective, seemingly quicker, and potentially superior to callin.io's AnalyzeImage node.
callin.io already features a node for the OpenAI GPT4 Vision API, named "OpenAI - Analyze Image". This was recently introduced, possibly in response to a request made in Please add support of the new OpenAI features [done] - #26 by tomtom
I conducted a few comparisons between OpenAI and Google for the same multimodal use case, involving an image and a prompt. Gemini performed quite well. It appears to be faster (comparing the Google console with the callin.io node, which isn't a perfectly fair comparison) and yielded better creative results (based on my impressions, also not a definitive comparison).
The most significant difference lies in pricing: for an image and prompt combination, Gemini is 4 times cheaper (based on an image of approximately 600x600). Google's pricing is a flat rate per image, whereas callin.io's pricing scales with image dimensions.
Therefore, I believe a Google-based node could be more popular than the callin.io-based node. The user interface and parameters for the Google node (prompt + image URL) could mirror those of the callin.io node.
Any resources to support this?
Vertex AI offers a sandbox within the Google Cloud console.
API documentation is available at Google Cloud console.
Pricing details can be found at Pricing | Generative AI on Vertex AI | Google Cloud.
I understand that Vertex AI is the designation for the GenAI multimodal API. PaLM exclusively handles text inputs and outputs. The model I was able to test within Vertex AI is named "gemini-1.0-pro-vision-001".
I disclaim any responsibility should Google decide to rename their models and products in a confusing manner at any point.
Are you willing to work on this?
I can create a fork of my workflow and assist in testing the requested node against the currently available OpenAIanalyzeImage node.
Hello, has anyone explored this yet?
I haven't received a response following that request. I assume it requires a certain number of upvotes to be considered?
I discovered that callin.io must be aware of it, as there's a landing page optimized for Gemini and Vertex AI keywords, but apparently nothing substantial behind it: Google Vertex AI integrations | Workflow automation with callin.io
This would be an amazing feature. We currently can't utilize any multi-modal functionalities within an AI Agent. It would be great to be able to pass an audio file directly to Gemini, for example, without needing to use Whisper first.
I’d second this; and would have thought the existing Gemini node could be tweaked to allow non-image binaries to be passed through, as the same functionality (submitting audio/video) can be achieved with the http node, albeit less elegantly!
There’s a wealth of potential in the Gemini multimodal abilities; it can analyse music for example, providing info on structure and influences etc., as well as just transcribe words, but I currently have to string code and http nodes together to achieve this.
To add, if it assists anyone willing and able to develop this, the Multimodal capabilities can be achieved via the HTTP node, as demonstrated below (utilizing a form submission prompt and the Gemini API instead of Vertex, but the core principles remain the same). Integrating these capabilities directly into the Gemini/Vertex nodes would be a significant advancement. I'm not aware of other models that can process various file types as effectively as Gemini 2+.