
BIOMEDICA INDEX
An Index for the Open Biomedical Image-Caption Archive
Dataset access
Access the data via HF Datasets 🤗
Contact us
Contact us with questions and suggestions!
Colab Tutorial
Learn how to use the index colab!
Despite the excitement behind biomedical artificial intelligence (AI), access to high-quality, diverse, and large-scale data –
the foundation for modern AI systems – is still a bottleneck to unlocking its full potential. To address this gap, we previously introduced
BIOMEDICA
an open-source dataset derived from the PubMed Central Open Access subset, containing over 6 million
scientific articles and 24 million image-text pairs, along with 27 metadata fields, including expert human annotations. To
overcome the challenges of accessing our large-scale dataset, we extend our tools to enable
content retrieval and on-demand, facilitating seamless integration with
AI systems. We further demonstrate the utility of the BIOMEDICA dataset and index by building embedding models, chat-style models, and
retrieval-augmented chat agents. Notably, all our AI models surpass previous open systems in their respective categories,
underscoring the critical role of diverse, high-quality, and large-scale biomedical data.

Figure 1: Overlap of BIOMEDICA dataset with the Landscape
of Biomedical Research : Overview of the BIOMEDICA dataset, tools for accessibility, and its applications. (A) The dataset comprises 6 million open-
access articles, 24 million image-caption pairs, and 30 million in-line references, spanning diverse biomedical domains such as clinical
radiology and pathology images, research microscopy, immunoassays, chemical structures, among other scientific images.
(B) To facilitate
AI model development and inference, we offer data streaming, filtering, and the BIOMEDICA Index. Streaming enables efficient training
without the need for extensive local storage. Data filtering allows users to create domain-specific subsets of the data. The BIOMEDICA
Index supports multi-modal retrieval-based applications. (C) The BIOMEDICA dataset enables diverse biomedical applications, including
chat models, embedding models, and agentic systems
Experiments
The scale and comprehensive content of BIOMEDICA establish it as a foundational resource for developing a wide range of AI systems. To demonstrate its utility, we create three demonstrator applications: an embedding model, a chat model, and a retrieval-augmented chat system. Additionally, we curate widely-used multimodal biomedical benchmarks to evaluate and compare the performance of our embedding and chat models against previous work.
Figure 3: BIOMEDICA enables state of the art performance across multiple applications. (A) Multimodal embedding model performance
on biomedical image classification tasks. (B) Autoregressive model performance on biomedical VQA tasks (results for other previous
models are obtained from [31]). (C) Autoregressive model performance on biomedical guidelines QA across four LLMs with and without
retrieval augmentation using the BIOMEDICA Index.
Embedding models trained on the BIOMEDICA dataset lead to the better multimodal representations. By learning shared representations across diverse data types, joint vision-language modeling through contrastive learning has emerged as a powerful approach to improve image classification and multimodal retrieval. Within biomedical applications, integrating multiple imaging modalities enables a more comprehensive and holistic view of diseases and patient states. We leverage annotations and metadata from the BIOMEDICA dataset to filter out non-biomedical images (e.g., plots and tables) and pretrain a contrastive vision-language model, BMC-CLIP. Our systematic evaluations across 41 datasets show that BMC-CLIP outperforms prior work (Figure 3A).
Vision-language autoregressive models trained on the BIOMEDICA dataset achieve competitive performance at a fraction of the cost. By leveraging text as an interface, vision-language autoregressive models enable intuitive, barrier-free AI interaction. Thus, through natural language queries, these chat-like systems can perform tasks such as visual question answering, image captioning, object detection, and referral segmentation – key tasks that translate directly to clinical and biological applications. To this end, we use a small subset of the BIOMEDICA dataset to create a 2M alignment dataset and collect the training sets for VQA-RAD, SLAKE, PathVQA, and PMC-VQA to add instruction-following data. We then fine-tune SmolVLM, a 2.2B parameter model, to create BMC-SmolVLM and evaluate our system with four vision question-answering tasks. With only 2.2B parameters, our system achieves similar or better performance than previously small models (less than 15B), even outperforming some larger models (Figure 3B).
The BIOMEDICA Index enables AI agentic systems to answer medical guideline-derived questions. Agentic systems can assist medical practitioners in their daily workflows, from virtual tumor boards to virtual case conferences. To this end, we use our previously developed embedding models and the BIOMEDICA index to create a retrieval-augmented AI agent (BMC-agent). Our system is the first of its kind, capable of querying similar images, captions, and full-text articles using images, text or both to query data across the entire dataset efficiently with an average 123.89ms ± 4.07ms latency (Figure 4). Furthermore, BMC-agent allows easy integration of any public or closed VLM/LLM. To evaluate BMC-agent, we curated a clinician-verified dataset of 50 questions derived from neurology, molecular pathology, and pharmacogenomics guidelines published up to January 2023 (before the GPT-4o knowledge cut-off in April 2023). We test our AI agent using four different LLMs/VLMs: DeepSeekR1 (Llama-based 70B), Qwen2-VL (72B), GPT4o, and Llama-3.3 (70B). Our evaluations show that incorporating the BIOMEDICA index - which allows agents to search and synthesize full-text articles with relevant information - improves performance by an average of 36.22% across all evaluated models (Figure 3C).

Figure 4: Query time as a function of the number of words (tokens)
in the BIOMEDICA index. The blue solid line represents the mean
query time, while the shaded blue region indicates the standard
error. The dashed gray line denotes the linear trend in query time
as token count increases (R = 0.902).