Institution

Microsoft Research

Microsoft's research division, contributing foundational work from ResNet to the Phi small-model family.

Arbor: Autonomous Research With Hypothesis Trees

Arbor stores research attempts in a persistent hypothesis tree, then admits changes only through held-out evaluation. It reports best held-out results on six AO tasks and 86.36% Any Medal on MLE-Bench Lite.

Text Embeddings · Microsoft Research

E5: Weakly-Supervised Contrastive Text Embeddings

E5 turns general-purpose text embeddings into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

AI for Science · Microsoft Research

MatterGen Explained: Diffusion for Inverse Materials Design

MatterGen is a diffusion model that generates inorganic crystals matching a target property — and the one example it actually synthesized, TaCr2O6, came within 20% of its 200 GPa stiffness goal.

Speech Synthesis · Microsoft Research

NaturalSpeech 2: Diffusion TTS Beyond Codec LMs

NaturalSpeech 2 uses latent diffusion over neural-audio-codec vectors and scales to 44K hours of speech and singing, aiming for stronger zero-shot prosody than token LMs.

Speech Synthesis · Microsoft Research

VALL-E: Zero-Shot Voice Cloning with Audio Tokens

VALL-E reframes TTS as codec-token language modeling: 60K hours of speech plus a 3-second prompt produce personalized zero-shot speech, but safety and release constraints matter.

World Models · Microsoft Research

Mirage: Latent Spatial Memory Makes Video World Models 10x Faster

Mirage stores a video world model's 3D memory inside diffusion latent space instead of an RGB point cloud, hitting state-of-the-art WorldScore (70.36) while running 10.57x faster and using 55x less GPU memory.

Text-to-Image · Microsoft Research

Lens: A 3.8B Text-to-Image Model Trained on ~19% of Z-Image's Compute

Microsoft's Lens is a 3.8B-parameter text-to-image diffusion model that matches 6B+ rivals while using about 19.3% of Z-Image's training compute, mostly by feeding it longer, denser captions.

Multimodal Models · Microsoft Research

LLaVA Explained: Visual Instruction Tuning for a Vision-Language Chat Model

LLaVA bolts a CLIP vision encoder onto a Vicuna LLM with one linear projection, then trains on GPT-4-generated image instructions — hitting 85.1% of GPT-4's score and 92.53% on ScienceQA.

Efficient AI · Microsoft Research

LoRA Explained: Low-Rank Adaptation for Fine-Tuning LLMs

LoRA freezes a pretrained model and trains tiny low-rank matrices per layer instead — cutting trainable parameters up to 10,000x and GPU memory 3x versus full GPT-3 175B fine-tuning, with no extra latency.

Efficient AI · Microsoft Research

Phi-3-mini: A 3.8B Model That Rivals GPT-3.5 on Your Phone

Phi-3-mini is a 3.8B-parameter model trained on 3.3T heavily filtered and synthetic tokens that hits 69% on MMLU and 8.38 on MT-bench — matching Mixtral 8x7B and GPT-3.5 while small enough to run on a phone.

Vision Foundation Models · Microsoft Research

ResNet Explained: Deep Residual Learning for Image Recognition

ResNet adds skip connections so a layer learns a residual instead of a full mapping, making 152-layer networks trainable. An ensemble hit 3.57% top-5 error on ImageNet and won ILSVRC 2015.

AI Agents · Microsoft Research

SkillOpt: Training a Frozen Agent's Skill Text Like a Model

SkillOpt trains a single skill document for a frozen LLM agent with bounded add/delete/replace edits and a held-out gate, lifting GPT-5.5 by +23.5 points in direct chat across six benchmarks.