File size: 1,567 Bytes
dfdc6d9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
---
license: mit
language:
- en
tags:
- attention
- temporal-reasoning
- time-series
- inductive-bias
- plug-and-play
---
# TemporalSelfAttention - A Time-Biased Attention Module
> Give Transformers a sense of time - not by scaling, but by structure.
---
## Why?
Standard attention treats all tokens equally in time.
This works for syntax, but breaks for:
- Temporal event ordering
- Causal reasoning
- Timeline consistency
- Long-range narrative coherence
💡 Insight: These models *simulate* time via token position. We inject it *structurally* with a tiny inductive bias.
---
## Core Equation
The time-aware attention score is computed as:
$$
\text{score}_{ij} = \frac{Q_i \cdot K_j^\top}{\sqrt{d_k}} + \gamma \cdot f(t_j - t_i)
$$
### Notation
| Symbol | Description |
|-----------------|-------------|
| \\( \text{score}_{ij} \\) | Attention score between query at position \\( i \\) and key at position \\( j \\) |
| \\( Q_i \\) | Query vector for position \\( i \\) |
| \\( K_j \\) | Key vector for position \\( j \\) |
| \\( d_k \\) | Dimension of key vectors |
| \\( \gamma \\) | Learnable time bias strength |
| \\( f(\cdot) \\) | Time difference function |
| \\( t_j - t_i \\) | Relative time difference |
## How To Use
```python
from temporal_attention import TemporalSelfAttention
model = TemporalSelfAttention(
embed_dim=64,
num_heads=1,
bias_type="linear", # or 'gaussian'
gamma=1.0,
causal=False
)
# x: (B, T, D), timestamps: (B, T)
output, weights = model(x, timestamps) |