
Chris
June 21, 2024
FEATURED
Microsoft drops Florence-2, a unified model to handle a variety of vision tasks
Today, Microsoft’s Azure AI team dropped a new vision foundation model called Florence-2 on Hugging Face. Available under a permissive MIT license, the model can handle a variety of vision and vision-language tasks using a unified, prompt-based representation. It comes in two sizes — 232M and 771M parameters — and already excels at tasks such as captioning, object detection, visual grounding and segmentation, performing on par or better than many large vision models out there.

Join our AI & tools
news weekly newsletter!
While the real-world performance of the model is yet to be tested, the work is expected to give enterprises a single, unified approach to handle different types of vision applications. This will save investments on separate task-specific vision models that fail to go beyond their primary function, without extensive fine-tuning.

What makes Florence-2 unique?
Today, large language models (LLMs) sit at the heart of enterprise operations. A single model can provide summaries, write marketing copies and even handle customer service in many cases. The level of adaptability across domains and tasks has been amazing. But, this success has also left researchers wondering: Can vision models, which have been largely task-specific, do the same?
At the core, vision tasks are more complex than text-based natural language processing (NLP). They demand comprehensive perceptual ability. Essentially, to achieve universal representation of diverse vision tasks, a model must be capable of understanding spatial data across different scales, from broad image-level concepts like object location, to fine-grained pixel details, as well as semantic details such as high-level captions to detailed descriptions.
When Microsoft tried solving this, it found two key roadblocks: Scarcity of comprehensively annotated visual datasets and the absence of a unified pretraining framework with a singular network architecture that integrated the ability to understand spatial hierarchy and semantic granularity.
To address this, the company first used specialized models to generate a visual dataset called FLD-5B. It included a total of 5.4 billion annotations for 126 million images, covering details from high-level descriptions to specific regions and objects. Then, using this data, it trained Florence-2, which uses a sequence-to-sequence architecture (a type of neural network designed for tasks involving sequential data) integrating an image encoder and a multi-modality encoder-decoder. This enables the model to handle various vision tasks, without requiring task-specific architectural modifications.
“All annotations in the dataset, FLD-5B, are uniformly standardized into textual outputs, facilitating a unified multi-task learning approach with consistent optimization with the same loss function as the objective,” the researchers wrote in the paper detailing the model. “The outcome is a versatile vision foundation model capable of performing a variety of tasks… all within a single model governed by a uniform set of parameters. Task activation is achieved through textual prompts, reflecting the approach used by large language models.”
