LLM2Vec4CXR - Fine-tuned Model for Chest X-ray Report Analysis

LLM2Vec4CXR is a text encoder optimized for chest X-ray report analysis and medical text understanding.
It is introduced in our paper Exploring the Capabilities of LLM Encoders for Image–Text Retrieval in Chest X-rays.

Model Description

LLM2Vec4CXR is a bidirectional text encoder fine-tuned with a latent_attention pooling strategy.
This design enhances semantic representation of chest X-ray reports, making the model robust across different reporting styles and effective even with domain-specific abbreviations.
It improves performance on clinical text similarity, retrieval, and interpretation tasks.

Key Features

Base Architecture: LLM2CLIP-Llama-3.2-1B-Instruct
Pooling Mode: Latent Attention (trained weights automatically loaded)
Bidirectional Processing: Enabled for better context understanding
Medical Domain: Specialized for chest X-ray report analysis
Max Length: 512 tokens
Precision: bfloat16
Automatic Loading: Latent attention weights are automatically loaded from safetensors
Simple API: Built-in methods for similarity computation and instruction-based encoding

Training Details

Training Data

Fully fine-tuned on chest X-ray reports and medical text data
Training focused on understanding pleural effusion status and other chest X-ray findings

Training Configuration

Pooling Mode: latent_attention (modified from base model)
Enable Bidirectional: True
Max Length: 512
Torch Dtype: bfloat16
Full Fine-tuning: All model weights were updated during training

Usage

Installation

# Only transformers is needed!
pip install transformers torch

Basic Usage

import torch
from transformers import AutoModel

# Load the model - that's it!
model = AutoModel.from_pretrained(
    "lukeingawesome/llm2vec4cxr",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
).to("cuda" if torch.cuda.is_available() else "cpu").eval()

# Simple text encoding
report = "Small left pleural effusion with basal atelectasis."
embedding = model.encode_text([report])
print(embedding.shape)  # torch.Size([1, 2048])

# Multiple texts at once
reports = [
    "No acute cardiopulmonary abnormality.",
    "Small bilateral pleural effusions.",
    "Large left pleural effusion with compressive atelectasis."
]
embeddings = model.encode_text(reports)
print(embeddings.shape)  # torch.Size([3, 2048])

Instruction-Based Encoding and Similarity

import torch
from transformers import AutoModel

# Load model
model = AutoModel.from_pretrained(
    "lukeingawesome/llm2vec4cxr",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
).to("cuda" if torch.cuda.is_available() else "cpu").eval()

# Instruction-based task with separator
instruction = "Determine the status of the pleural effusion."
report = "There is a small increase in the left-sided effusion."
query = instruction + "!@#$%^&*()" + report

# Compare against multiple candidates
candidates = [
    "No pleural effusion",
    "Pleural effusion present",
    "Worsening pleural effusion",
    "Improving pleural effusion"
]

# One-line similarity computation
scores = model.compute_similarities(query, candidates)
print(scores)
# tensor([0.7171, 0.8270, 0.9155, 0.8113], device='cuda:0')

best_match = candidates[torch.argmax(scores)]
print(f"Best match: {best_match}")
# Best match: Worsening pleural effusion

Medical Report Retrieval Example

import torch
from transformers import AutoModel

# Load model
model = AutoModel.from_pretrained(
    "lukeingawesome/llm2vec4cxr",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
).to("cuda" if torch.cuda.is_available() else "cpu").eval()

# Instruction for retrieval
instruction = "Retrieve semantically similar reports"
query_report = "Small left pleural effusion with basal atelectasis."
query = instruction + "!@#$%^&*()" + query_report

# Candidate reports
candidates = [
    "No acute cardiopulmonary abnormality.",
    "Small left pleural effusion is present.",
    "Large right pleural effusion causing compressive atelectasis.",
    "Heart size is normal with no evidence of pleural effusion.",
]

# Compute similarities
scores = model.compute_similarities(query, candidates)

# Get most similar
best_idx = torch.argmax(scores)
print(f"Most similar: {candidates[best_idx]}")
print(f"Score: {scores[best_idx]:.4f}")

API Reference

The model provides three main methods:

`encode_text(texts, max_length=512)`

Simple text encoding for one or more texts.

Parameters:

texts: List of strings or single string
max_length: Maximum sequence length (default: 512)

Returns: Tensor of shape (batch_size, 2048)

📄 Related Papers:

Exploring the Capabilities of LLM Encoders for Image–Text Retrieval in Chest X-rays
Ko, Hanbin, et al. "Exploring the capabilities of LLM encoders for image–text retrieval in chest X-rays." arXiv preprint arXiv:2509.15234 (2025).
LLM2CLIP4CXR: A CLIP-based model that leverages the LLM2Vec encoder to align visual and textual representations of chest X-rays.

Parameters:

texts: List of strings with optional separator
separator: String separator (default: '!@#$%^&*()')
max_length: Maximum sequence length (default: 512)

Returns: Tensor of shape (batch_size, 2048)

The model has been evaluated on chest X-ray report analysis tasks, particularly for:

Text retrieval/encoder
Medical text similarity comparison
Clinical finding extraction

Parameters:

query_text: Single query string
candidate_texts: List of candidate strings
separator: String separator (default: '!@#$%^&*()')
max_length: Maximum sequence length (default: 512)

Returns: Tensor of shape (num_candidates,) with cosine similarity scores

Training Details

Training Data

Fully fine-tuned on chest X-ray reports and medical text data
Training focused on understanding pleural effusion status and other chest X-ray findings

Training Configuration

Pooling Mode: latent_attention (512 latents, 8 attention heads)
Enable Bidirectional: True
Max Length: 512 tokens
Torch Dtype: bfloat16
Full Fine-tuning: All model weights were updated during training

Technical Specifications

Model Type: Bidirectional Language Model (LLM2Vec)
Architecture: LlamaBiModel (modified Llama 3.2) + Latent Attention Pooling
Parameters: ~1B parameters
Hidden Size: 2048
Input Length: Up to 512 tokens
Output Dimension: 2048
Precision: bfloat16
Dependencies: Only transformers and torch

Intended Use

Primary Use Cases

Medical Text Embeddings: Generate embeddings for chest X-ray reports
Clinical Text Similarity: Compare medical texts for semantic similarity
Medical Information Retrieval: Find relevant medical reports or findings
Clinical NLP Research: Foundation model for medical text analysis

Limitations

Specialized for chest X-ray reports - may not generalize to other medical domains
Requires careful preprocessing for optimal performance
Should be used as part of a larger clinical decision support system, not for standalone diagnosis

Evaluation

The model has been evaluated on chest X-ray report analysis tasks, particularly for:

Text retrieval and encoding
Medical text similarity comparison
Clinical finding extraction

Sample Performance

The model demonstrates consistent improvements over the base LLM2CLIP architecture on medical text understanding benchmarks.
LLM2Vec4CXR shows stronger performance in:

Handling medical abbreviations and radiological terminology
Capturing fine-grained semantic differences in chest X-ray reports
Understanding clinical context and temporal changes

Related Resources

📄 Paper: Exploring the Capabilities of LLM Encoders for Image–Text Retrieval in Chest X-rays

🔗 Related Projects:

LLM2CLIP4CXR: A CLIP-based model that leverages the LLM2Vec encoder to align visual and textual representations of chest X-rays

Citation

If you use this model in your research, please cite:

@article{ko2025exploring,
  title={Exploring the Capabilities of LLM Encoders for Image--Text Retrieval in Chest X-rays},
  author={Ko, Hanbin and Cho, Gihun and Baek, Inhyeok and Kim, Donguk and Koo, Joonbeom and Kim, Changi and Lee, Dongheon and Park, Chang Min},
  journal={arXiv preprint arXiv:2509.15234},
  year={2025}
}

Acknowledgments

This model is built upon:

LLM2Vec - Framework for converting decoder-only LLMs into text encoders
LLM2CLIP - Microsoft's implementation for connecting LLMs with CLIP models

License

This model is licensed under the MIT License.

Downloads last month: 168

Safetensors

Model size

1B params

Tensor type

BF16

Model tree for lukeingawesome/llm2vec4cxr

Base model

microsoft/LLM2CLIP-Llama-3.2-1B-Instruct-CC-Finetuned

Finetuned

(1)

this model