File size: 4,800 Bytes
66ce891
98a076a
 
66ce891
2a57330
 
98a076a
 
 
2a57330
66ce891
 
2a57330
98a076a
2a57330
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98a076a
 
 
2a57330
 
98a076a
2a57330
 
 
 
 
 
 
 
 
 
 
98a076a
2a57330
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98a076a
 
2a57330
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98a076a
 
 
 
2a57330
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98a076a
 
 
2a57330
 
 
 
 
 
 
 
98a076a
 
 
2a57330
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
---
language:
- en
tags:
- dllm
- diffusion-language-model
- text-generation
- diffusion
- language-model
license: apache-2.0
---

# HDLM-Gamma: Hybrid Diffusion Language Model

[![Paper](https://img.shields.io/badge/Paper-arXiv-red)](https://arxiv.org/abs/2504.06416)
[![Code](https://img.shields.io/badge/Code-GitHub-blue)](https://github.com/ServiceNow/hdlm)

This is the model card for **dlm-group/hdlm-base-gamma-0.01**. 

## Model Description

HDLM-Gamma is a hybrid diffusion language model that unifies autoregressive and diffusion-based sequence generation through gamma-hybrid noising. This model interpolates transition operators between absorbing and uniform processes, making it conceptually closer to SEDD (Lou et al. 2024) while maintaining the benefits of both paradigms.

The gamma parameter (γ) controls the blend between absorbing and uniform transition matrices: Q_gamma = (1-γ) * Q_absorb + γ * Q_uniform, where smaller values emphasize the absorbing process and larger values incorporate more uniform transitions.

## Model Architecture

- **Base Model**: Transformer architecture with staggered score conditioning
- **Vocabulary Size**: 50,258 tokens (GPT-2 vocabulary + absorbing token)
- **Context Length**: Variable (supports up to 2048 tokens)
- **Training**: Continuous-time diffusion with gamma-hybrid graph structure
- **Inference**: Analytic predictor with staggered score computation

## Usage

### Quick Start

```python
from hdlm.hf_utils import smart_model_loader
from hdlm.gamma_hybrid.sampling import get_sa_sampling_fn
from transformers import GPT2TokenizerFast
import torch

# Load model using smart loader (automatically detects model type)
model, cfg, device, accelerator, metaschedule = smart_model_loader(
    model_path="hdlm-group/hdlm-base-gamma-0.01",
    model_type="auto",  # automatically detects gamma_hybrid
    device="cuda"
)

# Load tokenizer
tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')

# Generate text
prompt = "The future of artificial intelligence"
prompt_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)

# Configure sampling function (automatically set up from config)
sampling_fn = get_sa_sampling_fn(
    config=cfg,
    graph=None,  # Will be created from config
    noise=None,  # Will be created from config
    meta_schedule=metaschedule,
    batch_dims=(1,),
    eps=1e-4,
    device=device
)

# Generate samples
generated = sampling_fn(
    model=model,
    prompt=prompt_ids,
    context_length=1024
)

# Decode generated text  
generated_text = tokenizer.decode(generated[0], skip_special_tokens=True)
print(generated_text)
```

### Evaluation

```bash
# Text generation evaluation
python hdlm/eval_generation.py \
    --checkpoint_path hdlm-group/hdlm-base-gamma-0.01 \
    --sampling_method SAR \
    --save_samples

# Perplexity evaluation
python hdlm/eval_modeling.py \
    --checkpoint_path hdlm-group/hdlm-base-gamma-0.01 \
    --work_dir "./logs/eval_modeling_gamma" \
    --dataset ptb
```

## Training Details

- **Dataset**: OpenWebText
- **Batch Size**: 256
- **Learning Rate**: 3e-4 with lambda scheduling
- **Gamma (γ)**: 0.01 (controls hybrid transition blend)
- **Graph Type**: QGamma with expanded sigma conditioning
- **Noise Schedule**: Log-linear (σ_min=1e-4, σ_max=10.0)
- **Training Steps**: 1M iterations
- **Warmup**: 50K steps

## Key Components

### Graph Structure
The QGamma graph combines absorbing and uniform transition matrices:
- **Absorbing component**: Transitions to absorbing state (mask token)
- **Uniform component**: Uniform transitions between all tokens
- **Hybrid blend**: Controlled by gamma parameter

### Staggered Score
The model uses staggered score computation that applies different transformations to absorbing and uniform branches before combining them, enabling more flexible generation patterns.

### Sampling Strategy
- **Predictor**: Analytic predictor with exact transition computation
- **Strategy**: Direct sampling with configurable strategy parameter
- **Noise Removal**: Optional final denoising step

## Model Variants

Available gamma values and their characteristics:

- **γ = 0.01**: Minimal uniform transitions, closest to pure absorbing process
- **γ = 0.1**: Moderate hybrid behavior with increased uniform mixing
- **γ = 0.5**: Balanced absorbing-uniform transition blend

## Citation

```bibtex
@article{fathi2025unifying,
  title={Unifying autoregressive and diffusion-based sequence generation},
  author={Fathi, Nima and Scholak, Torsten and No{\"e}l, Pierre-Andr{\'e}},
  journal={arXiv preprint arXiv:2504.06416},
  year={2025}
}
```

## License

This model is released under the same license as the original HDLM codebase. Please refer to the [GitHub repository](https://github.com/ServiceNow/hdlm) for license details.