Skip to content
Docs Try Aspire
Docs Try

Microsoft Fields

Class Fields 41 members
Models published by Microsoft.
AzureAIContentSafety Section titled AzureAIContentSafety staticreadonly FoundryModel

Azure AI Content Safety

Introduction

Azure AI Content Safety is a safety system for monitoring content generated by both foundation models and humans. Detect and block potential risks, threats, and quality problems. You can build an advanced safety system for foundation models to detect and mitigate harmful content and risks in user prompts and AI-generated outputs. Use Prompt Shields to detect and block prompt injection attacks, groundedness detection to pinpoint ungrounded or hallucinated materials, and protected material detection to identify copyrighted or owned content.

Core Features

  • Block harmful input and output

    • Description: Detect and block violence, hate, sexual, and self-harm content for both text, images and multimodal. Configure severity thresholds for your specific use case and adhere to your responsible AI policies.

    • Key Features: Violence, hate, sexual, and self-harm content detection. Custom blocklist.

  • Policy customization with custom categories

    • Description: Create unique content filters tailored to your requirements using custom categories. Quickly train a new custom category by providing examples of content you need to block.

    • Key Features: Custom categories

  • Identify the security risks

    • Description: Safeguard your AI applications against prompt injection attacks and jailbreak attempts. Identify and mitigate both direct and indirect threats with prompt shields.

    • Key Features: Direct jailbreak attack, indirect prompt injection from docs.

  • Detect and correct Gen AI hallucinations

    • Description: Identify and correct generative AI hallucinations and ensure outputs are reliable, accurate, and grounded in data with groundedness detection.

    • Key Features: Groundedness detection, reasoning, and correction.

  • Identify protected material

    • Description: Pinpoint copyrighted content and provide sources for preexisting text and code with protected material detection.

    • Key Features: Protected material for code, protected material for text

Use Cases

  • Generative AI services screen user-submitted prompts and generated outputs to ensure safe and appropriate content.

  • Online marketplaces monitor and filter product listings and other user-generated content to prevent harmful or inappropriate material.

  • Gaming platforms manage and moderate user-created game content and in-game communication to maintain a safe environment.

  • Social media platforms review and regulate user-uploaded images and posts to enforce community standards and prevent harmful content.

  • Enterprise media companies implement centralized content moderation systems to ensure the safety and appropriateness of their published materials.

  • K-12 educational technology providers filter out potentially harmful or inappropriate content to create a safe learning environment for students and educators.

Benefits

  • No ML experience required: Incorporate content safety features into your projects with no machine learning experience required.

  • Effortlessly customize your RAI policies: Customizing your content safety classifiers can be done with one line of description, a few samples using Custom Categories.

  • State of the art models: ready for use APIs, SOTA models, and flexible deployment options reduce the need for ongoing manual training or extensive customization. Microsoft has a science team and policy experts working on the frontier of Gen AI to constantly improve the safety and security models to ensure our customers can develop and deploy generative AI safely and responsibly.

  • Global Reach: Support more than 100 languages, enabling businesses to communicate effectively with customers, partners, and employees worldwide.

  • Scalable and Reliable: Built on Azure's cloud infrastructure, the Azure AI Content Safety service scales automatically to meet demand, from small business applications to global enterprise workloads.

  • Security and Compliance: Azure AI Content Safety runs on Azure's secure cloud infrastructure, ensuring data privacy and compliance with global standards. User data is not stored after the translation process.

  • Flexible deployment: Azure AI Content Safety can be deployed on cloud, on premises and on devices.

Technical Details

Pricing

Explore pricing options here: Azure AI Content Safety - Pricing | Microsoft Azure.

public static readonly FoundryModel AzureAIContentSafety
AzureAIContentUnderstanding Section titled AzureAIContentUnderstanding staticreadonly FoundryModel

Azure AI Content Understanding

Introduction

Azure AI Content Understanding empowers you to transform unstructured multimodal data—such as text, images, audio, and video—into structured, actionable insights. By streamlining content processing with advanced AI techniques like schema extraction and grounding, it delivers accurate structured data for downstream applications. Offering prebuilt templates for common use cases and customizable models, it helps you unify diverse data types into a single, efficient pipeline, optimizing workflows and accelerating time to value.

Core Features

  • Multimodal data ingestion Ingest a range of modalities such as documents, images, audio, or video. Use a variety of AI models to convert the input data into a structured format that can be easily processed and analyzed by downstream services or applications.

  • Customizable output schemas Customize the schemas of extracted results to meet your specific needs. Tailor the format and structure of summaries, insights, or features to include only the most relevant details—such as key points or timestamps—from video or audio files.

  • Confidence scores Leverage confidence scores to minimize human intervention and continuously improve accuracy through user feedback.

  • Output ready for downstream applications Automate business processes by building enterprise AI apps or agentic workflows. Use outputs that downstream applications can consume for reasoning with retrieval-augmented generation (RAG).

  • Grounding Ensure the information extracted, inferred, or abstracted is represented in the underlying content.

  • Automatic labeling Save time and effort on manual annotation and create models quicker by using large language models (LLMs) to extract fields from various document types.

Use Cases

  • Post-call analytics for call centers: Generate insights from call recordings, track key performance indicators (KPIs), and answer customer questions more accurately and efficiently.

  • Tax process automation: Streamline the tax return process by extracting data from tax forms to create a consolidated view of information across various documents.

  • Media asset management: Extract features from images and videos to provide richer tools for targeted content and enhance media asset management solutions.

  • Chart understanding: Enhance chart understanding by automating the analysis and interpretation of various types of charts and diagrams using Content Understanding.

Benefits

  • Streamline workflows: Azure AI Content Understanding standardizes the extraction of content, structure, and insights from various content types into a unified process.

  • Simplify field extraction: Field extraction in Content Understanding makes it easier to generate structured output from unstructured content. Define a schema to extract, classify, or generate field values with no complex prompt engineering.

  • Enhance accuracy: Content Understanding employs multiple AI models to analyze and cross-validate information simultaneously, resulting in more accurate and reliable results.

  • Confidence scores & grounding: Content Understanding ensures the accuracy of extracted values while minimizing the cost of human review.

Technical Details

Pricing

View up-to-date pay-as-you-go pricing details here: Azure AI Content Understanding pricing.

public static readonly FoundryModel AzureAIContentUnderstanding
AzureAIDocumentIntelligence Section titled AzureAIDocumentIntelligence staticreadonly FoundryModel

Azure AI Document Intelligence

Document Intelligence is a cloud-based service that enables you to build intelligent document processing solutions. Massive amounts of data, spanning a wide variety of data types, are stored in forms and documents. Document Intelligence enables you to effectively manage the velocity at which data is collected and processed and is key to improved operations, informed data-driven decisions, and enlightened innovation.

Core Features

  • General extraction models

    • Description: General extraction models enable text extraction from forms and documents and return structured business-ready content ready for your organization's action, use, or development.

    • Key Features

      • Read model allows you to extract written or printed text liens, words, locations, and detected languages.

      • Layout model, on top of text extraction, extracts structural information like tables, selection marks, paragraphs, titles, headings, and subheadings. Layout model can also output the extraction results in a Markdown format, enabling you to define your semantic chunking strategy based on provided building blocks, allowing for easier RAG (Retrieval Augmented Generation).

  • Prebuilt models

    • Description: Prebuilt models enable you to add intelligent document processing to your apps and flows without having to train and build your own models. Prebuilt models extract a pre-defined set of fields depending on the document type.

    • Key Features

      • Financial Services and Legal Documents: Credit Cards, Bank Statement, Pay Slip, Check, Invoices, Receipts, Contracts.

      • US Tax Documents: Unified Tax, W-2, 1099 Combo, 1040 (multiple variations), 1098 (multiple variations), 1099 (multiple variations).

      • US Mortgage Documents: 1003, 1004, 1005, 1008, Closing Disclosure.

      • Personal Identification Documents: Identity Documents, Health Insurance Cards, Marriage Certificates.

  • Custom models

    • Description: Custom models are trained using your labeled datasets to extract distinct data from forms and documents, specific to your use cases. Standalone custom models can be combined to create composed models.

    • Key Features

      • Document field extraction models

        • Custom generative: Build a custom extraction model using generative AI for documents with unstructured format and varying templates.

        • Custom neural: Extract data from mixed-type documents.

        • Custom template: Extract data from static layouts.

        • Custom composed: Extract data using a collection of models. Explicitly choose the classifier and enable confidence-based routing based on the threshold you set.

      • Custom classification models

        • Custom classifier: Identify designated document types (classes) before invoking an extraction model.

  • Add-on capabilities

    • Description: Use the add-on features to extend the results to include more features extracted from your documents. Some add-on features incur an extra cost. These optional features can be enabled and disabled depending on the scenario of the document extraction.

    • Key Features

      • High resolution extraction

      • Formula extraction

      • Font extraction

      • Barcode extraction

      • Language detection

      • Searchable PDF output

Use Cases

  • Accounts payable: A company can increase the efficiency of its accounts payable clerks by using the prebuilt invoice model and custom forms to speed up invoice data entry with a human in the loop. The prebuilt invoice model can extract key fields, such as Invoice Total and Shipping Address.

  • Insurance form processing: A customer can train a model by using custom forms to extract a key-value pair in insurance forms and then feeds the data to their business flow to improve the accuracy and efficiency of their process. For their unique forms, customers can build their own model that extracts key values by using custom forms. These extracted values then become actionable data for various workflows within their business.

  • Bank form processing: A bank can use the prebuilt ID model and custom forms to speed up the data entry for "know your customer" documentation, or to speed up data entry for a mortgage packet. If a bank requires their customers to submit personal identification as part of a process, the prebuilt ID model can extract key values, such as Name and Document Number, speeding up the overall time for data entry.

  • Robotic process automation (RPA): Using the custom extraction model, customers can extract specific data needed from distinct types of documents. The key-value pair extracted can then be entered into various systems such as databases, or CRM systems, through RPA, replacing manual data entry. Customers can also use custom classification model to categorize documents based on their content and file them in proper location. As such, an organized set of data extracted from the custom model can be an essential first step to document RPA scenarios for businesses that manage large volumes of documents regularly.

Benefits

  • No experience required: Incorporate Document Intelligence features into your projects with no machine learning experience required.

  • Effortlessly customize your models: Training your own custom extraction and classification model can be done with as little as one document labeled, making it easy to train your own models.

  • State of the art models: ready for use APIs, constantly enhanced models, and flexible deployment options reduce the need for ongoing manual training or extensive customization.

Technical Details:

Pricing

View up-to-date pricing information for the pay-as-you-go pricing model here: Azure AI Document Intelligence pricing.

public static readonly FoundryModel AzureAIDocumentIntelligence
Azure AI Language service.
public static readonly FoundryModel AzureAILanguage
AzureAITranslator Section titled AzureAITranslator staticreadonly FoundryModel
Azure AI Translator service.
public static readonly FoundryModel AzureAITranslator

Azure AI Vision

Introduction

The Azure AI Vision service gives you access to advanced algorithms that process images and videos and return insights based on the visual features and content you are interested in. Azure AI Vision can power a diverse set of scenarios, including digital asset management, video content search & summary, identity verification, generating accessible alt-text for images, and many more. The key product categories for Azure AI Vision include Video Analysis, Image Analysis, Face, and Optical Character Recognition.

Core Features

  • Video analysis

    • Description: Video Analysis includes video-related features like Spatial Analysis and Video Retrieval. Spatial Analysis analyzes the presence and movement of people on a video feed and produces events that other systems can respond to. Video Retrieval lets you create an index of videos that you can search in your natural language.

    • Key Features: Video retrieval, spatial analysis, person counting, person in a zone, person crossing a line, person distance

  • Face

    • Description: The Face service provides AI algorithms that detect, recognize, and analyze human faces in images. Facial recognition software is important in many different scenarios, such as identification, touchless access control, and face blurring for privacy.

    • Key Features: Face detection and analysis, face liveness, face identification, face verification

  • Image analysis

    • Description: The Image Analysis service extracts many visual features from images, such as objects, faces, adult content, and auto-generated text descriptions.

    • Key Features: Image tagging, image classification, object detection, image captioning, dense captioning, face detection, optical character recognition, image embeddings, and image search

  • Optical character recognition

    • Description: The Optical Character Recognition (OCR) service extracts text from images. You can use the Read API to extract printed and handwritten text from photos and documents. It uses deep-learning-based models and works with text on various surfaces and backgrounds. These include business documents, invoices, receipts, posters, business cards, letters, and whiteboards. The OCR APIs support extracting printed text in several languages.

    • Key Features: OCR

Use Cases

  • Boost content discovery with image analysis

  • Verify identities with the Face service

  • Search content in videos

Benefits

  • No experience required: Incorporate vision features into your projects with no machine learning experience required.

  • Effortlessly customize your models: Customizing your image classification and object detection models can be done with as little as one image per tag, making it easy to train your own models.

  • State of the art models: Ready to use APIs, constantly enhanced models, and flexible deployment options reduce the need for ongoing manual training or extensive customization.

Technical Details

Pricing

View up-to-date pricing information for the pay-as-you-go pricing model here: Azure AI Vision pricing.

public static readonly FoundryModel AzureAIVision
AzureContentUnderstandingLayout Section titled AzureContentUnderstandingLayout staticreadonly FoundryModel

Azure Content Understanding - Layout

Content Understanding Layout offers rich, structure‑aware extraction that captures text, formatting, tables, figures, and geometric layout details. It’s designed for complex document understanding workflows that require positional accuracy and deeper structural insights.

Azure Content Understanding

Azure Content Understanding uses generative AI to process/ingest content of many types (documents, images, videos, and audio) into a user-defined output format. It offers a streamlined process to reason over large amounts of unstructured data, accelerating time-to-value by generating an output that can be integrated into automation and analytical workflows.

Key capabilities

About this model

The Layout model offers rich, structure‑aware analysis for documents that require deeper understanding of formatting, hierarchy, and spatial relationships. It combines textual extraction with geometric layout detection to support advanced automation and content reasoning.

Key model capabilities

  • Extracts detailed content and layout elements such as words, paragraphs, tables, figures, and sections

  • Identifies document structure, formatting patterns, and hierarchical organization

  • Extracts hyperlinks embedded in documents

  • Captures annotations such as highlights, underlines, and strikethroughs in digital PDFs

  • Provides precise positional information for all extracted elements

  • Detects all figure types— charts, diagrams, pictures, icons, and other images—with bounding box details ( PDF only)

  • Suitable for advanced workflows such as document automation, RAG indexing, semantic search, and any process demanding fine‑grained layout understanding

Pricing

View up-to-date pay-as-you-go pricing details here: Azure AI Content Understanding pricing.

Technical details

More information

Learn more in the full Azure AI Content Understanding documentation.

public static readonly FoundryModel AzureContentUnderstandingLayout
AzureContentUnderstandingRead Section titled AzureContentUnderstandingRead staticreadonly FoundryModel

Azure Content Understanding - Read

Content Understanding Read provides fast, reliable extraction of text and basic content elements from documents, enabling simple ingestion workflows without layout interpretation. It’s ideal for scenarios where clean text output is needed for downstream automation, classification, or search.

Azure Content Understanding

Azure Content Understanding uses generative AI to process/ingest content of many types (documents, images, videos, and audio) into a user-defined output format. It offers a streamlined process to reason over large amounts of unstructured data, accelerating time-to-value by generating an output that can be integrated into automation and analytical workflows.

Key capabilities

About this model

The Read model provides foundational text extraction capabilities for simple, fast, and reliable ingestion of document content. It focuses on capturing textual elements without performing layout or structural analysis, making it ideal for lightweight processing and downstream text-based workflows.

Key model capabilities

  • Extracts fundamental content elements such as words, lines, paragraphs, formulas, and barcodes

  • Provides basic OCR functionality for a wide range of document types

  • Returns text results without layout interpretation

  • Best suited for scenarios requiring quick ingestion, metadata extraction, transcription, or feeding clean text into analytic or search pipelines

Pricing

View up-to-date pay-as-you-go pricing details here: Azure AI Content Understanding pricing.

Technical details

More information

Learn more in the full Azure AI Content Understanding documentation.

public static readonly FoundryModel AzureContentUnderstandingRead
AzureLanguageConversationalPiiRedaction Section titled AzureLanguageConversationalPiiRedaction staticreadonly FoundryModel
PII Redaction for Conversation automatically detects and masks sensitive information such as names, addresses, phone numbers, credit card details, and other personally identifiable information (PII) in meeting transcripts.
public static readonly FoundryModel AzureLanguageConversationalPiiRedaction
AzureLanguageDocumentPiiRedaction Section titled AzureLanguageDocumentPiiRedaction staticreadonly FoundryModel
PII Redaction for Documents automatically detects and masks sensitive information such as names, addresses, phone numbers, credit card details, and other personally identifiable information (PII) in native documents including PDF, Word, and text files.
public static readonly FoundryModel AzureLanguageDocumentPiiRedaction
AzureLanguageLanguageDetection Section titled AzureLanguageLanguageDetection staticreadonly FoundryModel
Language detection quickly and accurately identifies the language of any text, supporting over 100 languages and dialects, including the ISO 15924 standard for a select number of languages.
public static readonly FoundryModel AzureLanguageLanguageDetection
AzureLanguageTextAnalyticsForHealth Section titled AzureLanguageTextAnalyticsForHealth staticreadonly FoundryModel
Text Analytics for Health extracts and labels relevant medical information from unstructured clinical text, including doctors' notes, discharge summaries, and electronic health records, using named entity recognition, relation extraction, entity linking, a
public static readonly FoundryModel AzureLanguageTextAnalyticsForHealth
AzureLanguageTextPiiRedaction Section titled AzureLanguageTextPiiRedaction staticreadonly FoundryModel
PII Redaction for Text automatically detects and masks sensitive information such as names, addresses, phone numbers, credit card details, and other personally identifiable information (PII) in unstructured text.
public static readonly FoundryModel AzureLanguageTextPiiRedaction
AzureSpeechSpeechToText Section titled AzureSpeechSpeechToText staticreadonly FoundryModel
Transcribes streaming or recorded audio into readable text across 140+ languages and dialects. Accuracy can be further optimized with custom models for your specialized use cases.
public static readonly FoundryModel AzureSpeechSpeechToText
AzureSpeechSpeechTranslation Section titled AzureSpeechSpeechTranslation staticreadonly FoundryModel
Translates streaming or recorded audio into text or audio across 140+ languages and dialects. Accuracy can be further optimized with custom models for your specialized use cases.
public static readonly FoundryModel AzureSpeechSpeechTranslation
AzureSpeechTextToSpeech Section titled AzureSpeechTextToSpeech staticreadonly FoundryModel

Text-to-speech enables your applications, tools, or devices to convert text into natural synthesized speech. It leverages advanced out-of-the-box [prebuilt neural voices](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?t

public static readonly FoundryModel AzureSpeechTextToSpeech
AzureSpeechTextToSpeechAvatar Section titled AzureSpeechTextToSpeechAvatar staticreadonly FoundryModel
Text to speech avatar converts text into a digital video of a human (either a standard avatar or a custom text to speech avatar) speaking with a natural-sounding voice. The text to speech avatar video can be synthesized asynchronously or in real time. Deve
public static readonly FoundryModel AzureSpeechTextToSpeechAvatar
AzureSpeechVoiceLive Section titled AzureSpeechVoiceLive staticreadonly FoundryModel
Voice Live API is a single unified API that enables low-latency, high-quality speech to speech interactions for voice agents.
public static readonly FoundryModel AzureSpeechVoiceLive
AzureTranslatorDocumentTranslation Section titled AzureTranslatorDocumentTranslation staticreadonly FoundryModel
Document translation is a cloud-based, multilingual service that uses AI to translate documents from one language to another while preserving the document layout.
public static readonly FoundryModel AzureTranslatorDocumentTranslation
AzureTranslatorTextTranslation Section titled AzureTranslatorTextTranslation staticreadonly FoundryModel
Text translation is a cloud-based, multilingual service that uses neural machine translation models (NMT) and/or large language models (LLM) to translate text from one language to another, supporting 135 languages.
public static readonly FoundryModel AzureTranslatorTextTranslation
LanguageDetection Section titled LanguageDetection staticreadonly FoundryModel
Azure Language Detection service.
public static readonly FoundryModel LanguageDetection
MAI-DS-R1 is a DeepSeek-R1 reasoning model that has been post-trained by the Microsoft AI team to fill in information gaps in the previous version of the model and improve its harm protections while maintaining R1 reasoning capabilities.
public static readonly FoundryModel MaiDSR1
MAI-Transcribe-1 is an ASR model built to deliver high quality batch transcription whenever the user speaks. It is designed to achieve high accuracy across 25 languages and to adapt seamlessly to diverse accents, dialects, and regional speech patterns.
public static readonly FoundryModel MaiTranscribe1
MAI-Voice-1 is a text-to-speech (TTS) model that generates high-quality single-speaker speech and, soon, multi-speaker speech for public preview. It produces audio that strictly follows the input transcript and supports per-turn emotion control as well as
public static readonly FoundryModel MaiVoice1
Model router is a deployable AI model that is trained to select the most suitable large language model (LLM) for a given prompt.
public static readonly FoundryModel ModelRouter
Phi35MiniInstruct Section titled Phi35MiniInstruct staticreadonly FoundryModel
Refresh of Phi-3-mini model.
public static readonly FoundryModel Phi35MiniInstruct
A new mixture of experts model
public static readonly FoundryModel Phi35MoEInstruct
Phi35VisionInstruct Section titled Phi35VisionInstruct staticreadonly FoundryModel
Refresh of Phi-3-vision model.
public static readonly FoundryModel Phi35VisionInstruct
Phi3Medium128kInstruct Section titled Phi3Medium128kInstruct staticreadonly FoundryModel
Same Phi-3-medium model, but with a larger context size for RAG or few shot prompting.
public static readonly FoundryModel Phi3Medium128kInstruct
Phi3Medium4kInstruct Section titled Phi3Medium4kInstruct staticreadonly FoundryModel
A 14B parameters model, proves better quality than Phi-3-mini, with a focus on high-quality, reasoning-dense data.
public static readonly FoundryModel Phi3Medium4kInstruct
Phi3Mini128kInstruct Section titled Phi3Mini128kInstruct staticreadonly FoundryModel
Same Phi-3-mini model, but with a larger context size for RAG or few shot prompting.
public static readonly FoundryModel Phi3Mini128kInstruct
Phi3Mini4kInstruct Section titled Phi3Mini4kInstruct staticreadonly FoundryModel
Tiniest member of the Phi-3 family. Optimized for both quality and low latency.
public static readonly FoundryModel Phi3Mini4kInstruct
Phi3Small128kInstruct Section titled Phi3Small128kInstruct staticreadonly FoundryModel
Same Phi-3-small model, but with a larger context size for RAG or few shot prompting.
public static readonly FoundryModel Phi3Small128kInstruct
Phi3Small8kInstruct Section titled Phi3Small8kInstruct staticreadonly FoundryModel
A 7B parameters model, proves better quality than Phi-3-mini, with a focus on high-quality, reasoning-dense data.
public static readonly FoundryModel Phi3Small8kInstruct
Phi3Vision128kInstruct Section titled Phi3Vision128kInstruct staticreadonly FoundryModel

Model Summary

Phi-3 Vision is a lightweight, state-of-the-art open multimodal model built upon datasets which include - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data both on text and vision. The model belongs to the Phi-3 model family, and the multimodal version comes with 128K context length (in tokens) it can support. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures.

Resources and Technical Documentation:

Training

Model

  • Architecture: Phi-3-Vision-128K-Instruct has 4.2B parameters and contains image encoder, connector, projector, and Phi-3 Mini language model.

  • Inputs: Text and Image. It’s best suited for prompts using the chat format.

  • Context length: 128K tokens

  • GPUs: 512 H100-80G

  • Training time: 1.5 days

  • Training data: 500B vision and text tokens

  • Outputs: Generated text in response to the input

  • Dates: Our models were trained between February and April 2024

  • Status: This is a static model trained on an offline text dataset with cutoff date Mar 15, 2024. Future versions of the tuned models may be released as we improve models.

  • Release Type: Open weight release

  • Release dates: The model weight is released on May 21, 2024.

Datasets

Our training data includes a wide variety of sources, and is a combination of

  1. publicly available documents filtered rigorously for quality, selected high-quality educational data and code;

  2. selected high-quality image-text interleave;

  3. newly created synthetic, “textbook-like” data for the purpose of teaching math, coding, common sense reasoning, general knowledge of the world (science, daily activities, theory of mind, etc.), newly created image data, e.g., chart/table/diagram/slides;

  4. high quality chat format supervised data covering various topics to reflect human preferences on different aspects such as instruct-following, truthfulness, honesty and helpfulness.

The data collection process involved sourcing information from publicly available documents, with a meticulous approach to filtering out undesirable documents and images. To safeguard privacy, we carefully filtered various image and text data sources to remove or scrub any potentially personal data from the training data.

More details can be found in the Phi-3 Technical Report.

Benchmarks

To understand the capabilities, we compare Phi-3 Vision-128K-Instruct with a set of models over a variety of zero-shot benchmarks using our internal benchmark platform.

 

Benchmark

Phi-3 Vision-128K-In1

LlaVA-1.6 Vicuna-7B

QWEN-VL Chat

Llama3-Llava-Next-8B

Claude-3 Haiku

Gemini 1.0 Pro V

GPT-4V-Turbo

MMMU

40.4

34.2

39.0

36.4

40.7

42.0

55.5

MMBench

80.5

76.3

75.8

79.4

62.4

80.0

86.1

ScienceQA

90.8

70.6

67.2

73.7

72.0

79.7

75.7

MathVista

44.5

31.5

29.4

34.8

33.2

35.0

47.5

InterGPS

38.1

20.5

22.3

24.6

32.1

28.6

41.0

AI2D

76.7

63.1

59.8

66.9

60.3

62.8

74.7

ChartQA

81.4

55.0

50.9

65.8

59.3

58.0

62.3

TextVQA

70.9

64.6

59.4

55.7

62.7

64.7

68.1

POPE

85.8

87.2

82.6

87.0

74.4

84.2

83.7

Intended Uses

Primary use cases

The model is intended for broad commercial and research use in English. The model provides uses for general purpose AI systems and applications with visual and text input capabilities which require

  1. memory/compute constrained environments;

  2. latency bound scenarios;

  3. general image understanding;

  4. OCR;

  5. chart and table understanding.

The model is designed to accelerate research on efficient language and multimodal models, for use as a building block for generative AI powered features.

Use case considerations

The model is not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case.

Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.

Responsible AI Considerations

Like other models, the Phi family of models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:

  • Quality of Service: The Phi models are trained primarily on English text. Languages other than English will experience worse performance English language varieties with less representation in the training data might experience worse performance than standard American English.

  • Representation of Harms & Perpetuation of Stereotypes: These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases.

  • Inappropriate or Offensive Content: These models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the use case.

  • Information Reliability: Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated.

  • Limited Scope for Code: Majority of Phi-3 training data is based in Python and use common packages such as "typing, math, random, collections, datetime, itertools". If the model generates Python scripts that utilize other packages or scripts in other languages, we strongly recommend users manually verify all API uses.

Developers should apply responsible AI best practices and are responsible for ensuring that a specific use case complies with relevant laws and regulations (e.g. privacy, trade, etc.). Important areas for consideration include:

  • Allocation: Models may not be suitable for scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (ex: housing, employment, credit, etc.) without further assessments and additional debiasing techniques.

  • High-Risk Scenarios: Developers should assess suitability of using models in high-risk scenarios where unfair, unreliable or offensive outputs might be extremely costly or lead to harm. This includes providing advice in sensitive or expert domains where accuracy and reliability are critical (ex: legal or health advice). Additional safeguards should be implemented at the application level according to the deployment context.

  • Misinformation: Models may produce inaccurate information. Developers should follow transparency best practices and inform end-users they are interacting with an AI system. At the application level, developers can build feedback mechanisms and pipelines to ground responses in use-case specific, contextual information, a technique known as Retrieval Augmented Generation (RAG).

  • Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case.

  • Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations.

  • Identification of individuals: models with vision capabilities may have the potential to uniquely identify individuals in images. Safety post-training steers the model to refuse such requests, but developers should consider and implement, as appropriate, additional mitigations or user consent flows as required in their respective jurisdiction, (e.g., building measures to blur faces in image inputs before processing).

Inference Samples

Inference type

Python sample (Notebook)

CLI with YAML

Real time

image-text-to-text-generation-online-endpoint.ipynb image-text-to-text-generation-online-endpoint.ipynb

image-text-to-text-generation-online-endpoint.sh image-text-to-text-generation-online-endpoint.sh

Sample inputs and outputs (for real-time inference)

Phi-3-vision model only supports single image per conversation. Specifically, please refer to below grid:

Single-turn

Multi-turn conversation

Single Image

Yes

Yes

Multiple Images

No

No

Sample Input

{
"input_data": {
"input_string": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.ilankelman.org/stopsigns/australia.jpg"
}
},
{
"type": "text",
"text": "What is shown in this image? Be extremely detailed and specific."
}
]
}
],
"parameters": { "temperature": 0.7, "max_new_tokens": 2048 }
}
}

Sample Output

{
"output": " The image captures a vibrant street scene. Dominating the left side of the image is a red stop sign, standing on a white pole. Adjacent to the stop sign, a white lion statue adds a touch of symbolism to the scene. \n\nThe background is filled with colorful buildings, including a red one and a yellow one, adding a lively atmosphere to the scene. The blue sky overhead and a clear white road underneath it complete the picture. \n\nAdding to the cultural context, there are Chinese characters visible in the background, suggesting the presence of a Chinese influence in this location. The overall scene is a blend of urban life and cultural elements."
}

Software

Hardware

Note that by default, the Phi-3-Vision-128K model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types:

  • NVIDIA A100

  • NVIDIA A6000

  • NVIDIA H100

License

The model is licensed under the MIT license.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.

public static readonly FoundryModel Phi3Vision128kInstruct
Phi-4 14B, a highly capable model for low latency scenarios.
public static readonly FoundryModel Phi4
3.8B parameters Small Language Model outperforming larger models in reasoning, math, coding, and function-calling
public static readonly FoundryModel Phi4MiniInstruct
Phi4MiniReasoning Section titled Phi4MiniReasoning staticreadonly FoundryModel
Lightweight math reasoning model optimized for multi-step problem solving
public static readonly FoundryModel Phi4MiniReasoning
Phi4MultimodalInstruct Section titled Phi4MultimodalInstruct staticreadonly FoundryModel
First small multimodal model to have 3 modality inputs (text, audio, image), excelling in quality and efficiency
public static readonly FoundryModel Phi4MultimodalInstruct
State-of-the-art open-weight reasoning model.
public static readonly FoundryModel Phi4Reasoning
Azure Language Text PII service.
public static readonly FoundryModel TextPii