LLM2Vec4CXR - Fine-tuned Model for Chest X-ray Report Analysis
LLM2Vec4CXR is a text encoder optimized for chest X-ray report analysis and medical text understanding.
It is introduced in our paper Exploring the Capabilities of LLM Encoders for ImageโText Retrieval in Chest X-rays.
Model Description
LLM2Vec4CXR is a bidirectional text encoder fine-tuned with a latent_attention pooling strategy.
This design enhances semantic representation of chest X-ray reports, making the model robust across different reporting styles and effective even with domain-specific abbreviations.
It improves performance on clinical text similarity, retrieval, and interpretation tasks.
Key Features
- Base Architecture: LLM2CLIP-Llama-3.2-1B-Instruct
- Pooling Mode: Latent Attention (trained weights automatically loaded)
- Bidirectional Processing: Enabled for better context understanding
- Medical Domain: Specialized for chest X-ray report analysis
- Max Length: 512 tokens
- Precision: bfloat16
- Automatic Loading: Latent attention weights are automatically loaded from safetensors
- Simple API: Built-in methods for similarity computation and instruction-based encoding
Training Details
Training Data
- Fully fine-tuned on chest X-ray reports and medical text data
- Training focused on understanding pleural effusion status and other chest X-ray findings
Training Configuration
- Pooling Mode:
latent_attention(modified from base model) - Enable Bidirectional: True
- Max Length: 512
- Torch Dtype: bfloat16
- Full Fine-tuning: All model weights were updated during training
Usage
Installation
# Only transformers is needed!
pip install transformers torch
Basic Usage
import torch
from transformers import AutoModel
# Load the model - that's it!
model = AutoModel.from_pretrained(
"lukeingawesome/llm2vec4cxr",
trust_remote_code=True,
torch_dtype=torch.bfloat16
).to("cuda" if torch.cuda.is_available() else "cpu").eval()
# Simple text encoding
report = "Small left pleural effusion with basal atelectasis."
embedding = model.encode_text([report])
print(embedding.shape) # torch.Size([1, 2048])
# Multiple texts at once
reports = [
"No acute cardiopulmonary abnormality.",
"Small bilateral pleural effusions.",
"Large left pleural effusion with compressive atelectasis."
]
embeddings = model.encode_text(reports)
print(embeddings.shape) # torch.Size([3, 2048])
Instruction-Based Encoding and Similarity
import torch
from transformers import AutoModel
# Load model
model = AutoModel.from_pretrained(
"lukeingawesome/llm2vec4cxr",
trust_remote_code=True,
torch_dtype=torch.bfloat16
).to("cuda" if torch.cuda.is_available() else "cpu").eval()
# Instruction-based task with separator
instruction = "Determine the status of the pleural effusion."
report = "There is a small increase in the left-sided effusion."
query = instruction + "!@#$%^&*()" + report
# Compare against multiple candidates
candidates = [
"No pleural effusion",
"Pleural effusion present",
"Worsening pleural effusion",
"Improving pleural effusion"
]
# One-line similarity computation
scores = model.compute_similarities(query, candidates)
print(scores)
# tensor([0.7171, 0.8270, 0.9155, 0.8113], device='cuda:0')
best_match = candidates[torch.argmax(scores)]
print(f"Best match: {best_match}")
# Best match: Worsening pleural effusion
Medical Report Retrieval Example
import torch
from transformers import AutoModel
# Load model
model = AutoModel.from_pretrained(
"lukeingawesome/llm2vec4cxr",
trust_remote_code=True,
torch_dtype=torch.bfloat16
).to("cuda" if torch.cuda.is_available() else "cpu").eval()
# Instruction for retrieval
instruction = "Retrieve semantically similar reports"
query_report = "Small left pleural effusion with basal atelectasis."
query = instruction + "!@#$%^&*()" + query_report
# Candidate reports
candidates = [
"No acute cardiopulmonary abnormality.",
"Small left pleural effusion is present.",
"Large right pleural effusion causing compressive atelectasis.",
"Heart size is normal with no evidence of pleural effusion.",
]
# Compute similarities
scores = model.compute_similarities(query, candidates)
# Get most similar
best_idx = torch.argmax(scores)
print(f"Most similar: {candidates[best_idx]}")
print(f"Score: {scores[best_idx]:.4f}")
API Reference
The model provides three main methods:
encode_text(texts, max_length=512)
Simple text encoding for one or more texts.
Parameters:
texts: List of strings or single stringmax_length: Maximum sequence length (default: 512)
Returns: Tensor of shape (batch_size, 2048)
๐ Related Papers:
- Exploring the Capabilities of LLM Encoders for ImageโText Retrieval in Chest X-rays
Ko, Hanbin, et al. "Exploring the capabilities of LLM encoders for imageโtext retrieval in chest X-rays." arXiv preprint arXiv:2509.15234 (2025). - LLM2CLIP4CXR: A CLIP-based model that leverages the LLM2Vec encoder to align visual and textual representations of chest X-rays.
Parameters:
texts: List of strings with optional separatorseparator: String separator (default:'!@#$%^&*()')max_length: Maximum sequence length (default: 512)
Returns: Tensor of shape (batch_size, 2048)
The model has been evaluated on chest X-ray report analysis tasks, particularly for:
- Text retrieval/encoder
- Medical text similarity comparison
- Clinical finding extraction
Parameters:
query_text: Single query stringcandidate_texts: List of candidate stringsseparator: String separator (default:'!@#$%^&*()')max_length: Maximum sequence length (default: 512)
Returns: Tensor of shape (num_candidates,) with cosine similarity scores
Training Details
Training Data
- Fully fine-tuned on chest X-ray reports and medical text data
- Training focused on understanding pleural effusion status and other chest X-ray findings
Training Configuration
- Pooling Mode:
latent_attention(512 latents, 8 attention heads) - Enable Bidirectional: True
- Max Length: 512 tokens
- Torch Dtype: bfloat16
- Full Fine-tuning: All model weights were updated during training
Technical Specifications
- Model Type: Bidirectional Language Model (LLM2Vec)
- Architecture: LlamaBiModel (modified Llama 3.2) + Latent Attention Pooling
- Parameters: ~1B parameters
- Hidden Size: 2048
- Input Length: Up to 512 tokens
- Output Dimension: 2048
- Precision: bfloat16
- Dependencies: Only transformers and torch
Intended Use
Primary Use Cases
- Medical Text Embeddings: Generate embeddings for chest X-ray reports
- Clinical Text Similarity: Compare medical texts for semantic similarity
- Medical Information Retrieval: Find relevant medical reports or findings
- Clinical NLP Research: Foundation model for medical text analysis
Limitations
- Specialized for chest X-ray reports - may not generalize to other medical domains
- Requires careful preprocessing for optimal performance
- Should be used as part of a larger clinical decision support system, not for standalone diagnosis
Evaluation
The model has been evaluated on chest X-ray report analysis tasks, particularly for:
- Text retrieval and encoding
- Medical text similarity comparison
- Clinical finding extraction
Sample Performance
The model demonstrates consistent improvements over the base LLM2CLIP architecture on medical text understanding benchmarks.
LLM2Vec4CXR shows stronger performance in:
- Handling medical abbreviations and radiological terminology
- Capturing fine-grained semantic differences in chest X-ray reports
- Understanding clinical context and temporal changes
Related Resources
๐ Paper: Exploring the Capabilities of LLM Encoders for ImageโText Retrieval in Chest X-rays
๐ Related Projects:
- LLM2CLIP4CXR: A CLIP-based model that leverages the LLM2Vec encoder to align visual and textual representations of chest X-rays
Citation
If you use this model in your research, please cite:
@article{ko2025exploring,
title={Exploring the Capabilities of LLM Encoders for Image--Text Retrieval in Chest X-rays},
author={Ko, Hanbin and Cho, Gihun and Baek, Inhyeok and Kim, Donguk and Koo, Joonbeom and Kim, Changi and Lee, Dongheon and Park, Chang Min},
journal={arXiv preprint arXiv:2509.15234},
year={2025}
}
Acknowledgments
This model is built upon:
- LLM2Vec - Framework for converting decoder-only LLMs into text encoders
- LLM2CLIP - Microsoft's implementation for connecting LLMs with CLIP models
License
This model is licensed under the MIT License.
- Downloads last month
- 168