Ferret AI from Apple - revolutionary multimodal AI technology
Apple has finally joined the AI community with its release of Ferret, a multimodal large language model that is particularly good at identifying regions within images.
The Ferret AI model is a sophisticated Multimodal Large Language Model (MLLM) that specialises in both referring and grounding tasks within images. It is designed to understand and interpret spatial references of any shape or granularity within an image.
The model has been developed by Apple in collaboration with researchers from Cornell University and represents a significant advancement in the field of artificial intelligence, particularly in the domain of Multimodal Large Language Models (MLLMs).
The model's primary functionality revolves around two key capabilities: referring and grounding, in relation to images and text. Ferret's referring and grounding capabilities allow it to interpret and connect textual descriptions with specific regions or objects in an image, making it an advanced tool for image-text multimodal tasks. This is more advanced than other models, including those with access to substantially larger datasets like GPT-4V.
Referring in Ferret
The "referring" ability of Ferret allows users to specify a particular region or an object within an image with a high level of precision. This can be done by pointing to the region or by drawing a bounding box or even a free-form shape around it. The Ferret MLLM will then takes the selected region and translate it into a format that it can understand and process.
The resulting format is a hybrid representation that combines both discrete coordinates (like the exact points or the corners of a bounding box) and continuous features (which might include various characteristics or attributes of the specified region). This hybrid representation is particularly innovative because it allows the model to process and understand regions within an image with a high degree of specificity and flexibility.
Grounding in Ferret
"Grounding" refers to Ferret's ability to link descriptive text to specific parts of an image. Essentially, it can take a textual description and find the corresponding area in an image that the text is describing. For example, if a description refers to "a red car in the lower left corner of the image," Ferret is capable of locating and marking this precise region.
This is done by identifying the bounding box coordinates that indicate the specific region of the image being referred to in the text. For instance, if a text mentions "a red car in the lower left corner of the image," Ferret can identify and mark this exact region in the image.
Ferret's impressive performance is due in part to its use of a large-scale dataset called GRIT (Ground-and-Refer Instruction-Tuning), which contains over 1.1 million samples. These samples encompass a wide range of spatial knowledge and include hard negative data* to enhance the robustness of the model.
The combination of the GRIT dataset and the model's architecture, which includes a spatial-aware visual sampler and the hybrid region representation, allows Ferret to perform exceptionally well in tasks that involve detailed image descriptions and minimising object hallucination (incorrectly identifying objects that aren't actually present).
The Ferret Bench
The Ferret-Bench is a multimodal evaluation benchmark to assess the model's capabilities across various tasks involving referring/grounding, semantics, knowledge, and reasoning. These evaluations are critical for understanding the model's strengths and areas for improvement and its potential in practical applications,such as aiding visually impaired users in understanding their surroundings or enhancing user experience in interactive AI systems.
The benchmark tests are designed to mimic real-life scenarios and include
- Referring Description Tasks: Here, Ferret is evaluated on its ability to describe specific areas within an image. These tasks test the model's understanding of spatial relationships and its capacity to provide detailed, contextually relevant descriptions based on the visual information.
- Referring Reasoning Tasks: These tasks delve deeper into Ferret's reasoning capabilities. The model is presented with scenarios where it must infer and reason about the relationships and interactions between different objects or regions in an image. This is crucial for applications where understanding the 'why' and 'how' behind visual elements is as important as recognizing them.
- Grounding in Conversation Tasks: This component of Ferret-Bench tests the model's ability to engage in a meaningful dialogue involving both visual and textual elements. It assesses how well Ferret can maintain coherence and context understanding while grounding its responses in specific visual references.
The following example from Apple's technical paper, Ferret: Refer and Ground Anything Anywhere at Any Granularity, is an example of Referring Reasoning in Ferret Bench. It shows how different models (LLaVA vs. Kosmos-2 vs. Shikra vs. Ferret) compare.
Ferret AI is capable of analysing an image, identifying specific regions or objects based on user queries, and providing detailed information about these regions or objects. It represents a significant advancement in AI's ability to interact with and understand visual data in a detailed and contextually rich manner.
* Hard Negative Data
In machine learning, 'hard negatives' are examples that are challenging for a model to classify correctly. These are the negatives that the model frequently misclassifies. Hard negative mining is a strategy to enhance a model's performance by focusing on these difficult-to-classify instances. Incorporating hard negatives into the training process forces the model to learn more robust and nuanced patterns, improving its overall accuracy and reducing misclassifications.
In tasks like object detection or classification, hard negatives are instances easily confused with positive examples due to their similar characteristics. Training with these examples helps the model better differentiate between classes and make more accurate predictions.