The future of AI models is multimodal, no doubt. However, this does not necessarily require training large new models.[ Visual ChatGPT : Now chatGPT can process the image] Instead, existing solutions can be combined together.
Microsoft added an important feature to its OpenAI chatbot ChatGPT, which was released in November 2022: image processing. Until now, the language model could only handle text, but Visual ChatGPT can send and receive GPT images as well as text.
According to the researchers, multimodal interaction models can be trained for this purpose, but this would require large amounts of data and computing resources. Furthermore, this approach is not very flexible and the model cannot be extended to other formats such as audio or video without retraining.
Linking ChatGPT to 22 Image Models
Instead of training a new model, the researchers linked ChatGPT to 22 different visual basis models (VFMs), including steady diffusion. These models perform various tasks, such as answering questions about images, creating and editing images, or extracting information such as depth data.
The team has developed a prompt manager that acts as a bridge between ChatGPT and VFM by performing the following functions:
- Clearly communicate the capabilities of each VFM in ChatGPT and define the input-output format.
- It converts various visual information such as PNG or images with depth information into a language format that ChatGPT can understand.
- Manages the dates, priorities and conflicts of various VFMs.
Visual Chat can generate GPT images, name them appropriately, save and keep them ready for further input or process user images as input.
If the chat model is not clear which VFM is best for solving the task, it will ask VisualChatGPT. It can also connect multiple VFMs this way.
While the examples Microsoft presents with Visual Chat GPT are promising, there are still some limitations. Visual ChatGPT is of course completely based on ChatGPT and related image models.
The upper limit on the number of tokens that ChatGPT can handle is a constraining factor as well. Also, converting VFM into a language requires a lot of preparation. recommendation
Previous developments have laid important groundwork.
Microsoft is integrating some existing methods into Visual ChatGPT for more control over image models with additional model or direct engineering. In recent months there have been many developments in this area, such as InstructPix2Pix, ControlNet or GLIGEN.