Update README with KL3M tokenizer paper citation - README.md
Browse files
README.md
CHANGED
|
@@ -10,6 +10,7 @@ tags:
|
|
| 10 |
- financial
|
| 11 |
- enterprise
|
| 12 |
- slm
|
|
|
|
| 13 |
date: '2024-02-20T00:00:00.000Z'
|
| 14 |
pipeline_tag: text-generation
|
| 15 |
widget:
|
|
@@ -18,9 +19,9 @@ widget:
|
|
| 18 |
- do_sample: True
|
| 19 |
---
|
| 20 |
|
| 21 |
-
# kl3m-003-1.7b
|
| 22 |
|
| 23 |
-
kl3m-1.7b is a small language model (SLM)
|
| 24 |
developed by [273 Ventures](https://273ventures.com) and donated to the [ALEA Institute](https://aleainstitute.ai),
|
| 25 |
kl3m-003-1.7b was part of the first LLM family to obtain the [Fairly Trained L-Certification](https://www.fairlytrained.org/certifications)
|
| 26 |
for its ethical training data and practices. The model is designed for legal, regulatory, and financial workflows,
|
|
@@ -29,40 +30,34 @@ with a focus on low toxicity and high efficiency.
|
|
| 29 |
Given its small size and lack of training data for instruction alignment, kl3m-003-1.7b is best suited for use either in
|
| 30 |
SLM fine-tuning or as part of training larger models without using unethical data or models.
|
| 31 |
|
| 32 |
-
The model was originally trained between January-February 2024 on a 8xA100-80G node in DDP. A similar model is
|
| 33 |
-
being provided with complete source and data replication as part of the `kl3m-004` family to be released in Q4 2024.
|
| 34 |
-
|
| 35 |
-
## Source
|
| 36 |
-
|
| 37 |
-
[https://github.com/alea-institute/kl3m-model-research](https://github.com/alea-institute/kl3m-model-research)
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
## Training Data
|
| 41 |
-
While the original training data collection and training infrastructure relies on software that was not donated by
|
| 42 |
-
273 Ventures, ALEA Institute is open-sourcing an improved dataset, including both replication and an API.
|
| 43 |
-
|
| 44 |
-
[https://github.com/alea-institute/kl3m-data](https://github.com/alea-institute/kl3m-data)
|
| 45 |
-
|
| 46 |
-
Data is available upon request at this time via S3 under a Requester Pays model. We are actively working on a
|
| 47 |
-
zero-cost distribution model as soon as we can obtain additional support.
|
| 48 |
-
|
| 49 |
-
This model, the original `kl3m-003-1.7b` model, was trained on a US-only subset of the Kelvin Legal DataPack that
|
| 50 |
-
we believe is 100% public domain material. However, so as to enforce maximum transparency to all
|
| 51 |
-
downstream users in the event of any future determination otherwise, we are licensing this model under CC-BY 4.0.
|
| 52 |
-
|
| 53 |
## Model Details
|
| 54 |
|
| 55 |
-
### Summary
|
| 56 |
- **Architecture**: GPT-NeoX (i.e., ~GPT-3 architecture)
|
| 57 |
-
- **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
- **Context Window**: 8,192 tokens (true size, no sliding window)
|
|
|
|
| 59 |
- **Language(s)**: Primarily English
|
| 60 |
-
- **
|
| 61 |
- **Developed by**: Originally by [273 Ventures LLC](https://273ventures.com), donated to [ALEA Institute](https://aleainstitute.ai)
|
| 62 |
- **License**: [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)
|
| 63 |
- **Hardware Requirements**: Runs real-time in bf16 on consumer NV/AMD GPUs
|
| 64 |
|
| 65 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
### Perplexity Scores
|
| 68 |
| Dataset | Score |
|
|
@@ -81,15 +76,9 @@ larger models as of its training data.
|
|
| 81 |
- **Enterprise Focus**: Specifically designed for legal, regulatory, and financial workflows.
|
| 82 |
- **Efficient Deployment**: Optimized for real-time inference on consumer hardware.
|
| 83 |
|
| 84 |
-
##
|
| 85 |
-
|
| 86 |
-
- Basic regulatory question answering
|
| 87 |
-
- Contract provision drafting
|
| 88 |
-
- Structured JSON information extraction
|
| 89 |
-
- Foundation for downstream optimization
|
| 90 |
-
- Base model for domain-specific fine-tuning
|
| 91 |
|
| 92 |
-
|
| 93 |
|
| 94 |
```python
|
| 95 |
import json
|
|
@@ -98,7 +87,7 @@ from transformers import pipeline
|
|
| 98 |
# Load the model and tokenizer
|
| 99 |
p = pipeline('text-generation', 'alea-institute/kl3m-003-1.7b', device='cuda')
|
| 100 |
|
| 101 |
-
# Example usage on
|
| 102 |
text = "Under this"
|
| 103 |
print(
|
| 104 |
json.dumps(
|
|
@@ -115,12 +104,12 @@ print(
|
|
| 115 |
[
|
| 116 |
"Under this section, any person who is a party to the proceeding may be required to file ",
|
| 117 |
"Under this subsection, the term **eligible entity** means a State, a political subdivision of ",
|
| 118 |
-
"Under this section, the Secretary shall
|
| 119 |
]
|
| 120 |
-
|
| 121 |
```
|
| 122 |
|
| 123 |
-
|
|
|
|
| 124 |
```python
|
| 125 |
text = "Governing Law. "
|
| 126 |
print(
|
|
@@ -142,7 +131,21 @@ print(
|
|
| 142 |
]
|
| 143 |
```
|
| 144 |
|
| 145 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 146 |
|
| 147 |
The model implements several techniques during training:
|
| 148 |
|
|
@@ -151,6 +154,77 @@ The model implements several techniques during training:
|
|
| 151 |
- Randomized padding
|
| 152 |
- Traditional fixed-attention mechanisms
|
| 153 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 154 |
## License
|
| 155 |
|
| 156 |
This model was originally developed by 273 Ventures and has been donated to the ALEA Institute.
|
|
@@ -169,14 +243,4 @@ The KL3M model family is now maintained by the [ALEA Institute](https://aleainst
|
|
| 169 |
|
| 170 |
Special thanks to 273 Ventures for developing and donating this model to the open-source community through the Alea Institute.
|
| 171 |
|
| 172 |
-
|
| 173 |
-
## Citation
|
| 174 |
-
|
| 175 |
-
Tokenizer, dataset, and model publications are pending.
|
| 176 |
-
|
| 177 |
-
## Contact
|
| 178 |
-
|
| 179 |
-
For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [[email protected]](mailto:[email protected]) or
|
| 180 |
-
create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-model-research).
|
| 181 |
-
|
| 182 |
-

|
|
|
|
| 10 |
- financial
|
| 11 |
- enterprise
|
| 12 |
- slm
|
| 13 |
+
- gpt-neox
|
| 14 |
date: '2024-02-20T00:00:00.000Z'
|
| 15 |
pipeline_tag: text-generation
|
| 16 |
widget:
|
|
|
|
| 19 |
- do_sample: True
|
| 20 |
---
|
| 21 |
|
| 22 |
+
# kl3m-003-1.7b
|
| 23 |
|
| 24 |
+
kl3m-003-1.7b is a small language model (SLM) trained on clean, legally-permissible data. Originally
|
| 25 |
developed by [273 Ventures](https://273ventures.com) and donated to the [ALEA Institute](https://aleainstitute.ai),
|
| 26 |
kl3m-003-1.7b was part of the first LLM family to obtain the [Fairly Trained L-Certification](https://www.fairlytrained.org/certifications)
|
| 27 |
for its ethical training data and practices. The model is designed for legal, regulatory, and financial workflows,
|
|
|
|
| 30 |
Given its small size and lack of training data for instruction alignment, kl3m-003-1.7b is best suited for use either in
|
| 31 |
SLM fine-tuning or as part of training larger models without using unethical data or models.
|
| 32 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
## Model Details
|
| 34 |
|
|
|
|
| 35 |
- **Architecture**: GPT-NeoX (i.e., ~GPT-3 architecture)
|
| 36 |
+
- **Size**: 1.7 billion parameters
|
| 37 |
+
- **Hidden Size**: 2048
|
| 38 |
+
- **Layers**: 32
|
| 39 |
+
- **Attention Heads**: 32
|
| 40 |
+
- **Intermediate Size**: 8192
|
| 41 |
+
- **Max Position Embeddings**: 8192
|
| 42 |
- **Context Window**: 8,192 tokens (true size, no sliding window)
|
| 43 |
+
- **Tokenizer**: [kl3m-001-32k](https://huggingface.co/alea-institute/kl3m-001-32k) BPE tokenizer (32,768 vocabulary size with unorthodox whitespace handling)
|
| 44 |
- **Language(s)**: Primarily English
|
| 45 |
+
- **Training Objective**: Next token prediction
|
| 46 |
- **Developed by**: Originally by [273 Ventures LLC](https://273ventures.com), donated to [ALEA Institute](https://aleainstitute.ai)
|
| 47 |
- **License**: [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)
|
| 48 |
- **Hardware Requirements**: Runs real-time in bf16 on consumer NV/AMD GPUs
|
| 49 |
|
| 50 |
+
## Use Cases
|
| 51 |
+
|
| 52 |
+
kl3m-003-1.7b is particularly effective for:
|
| 53 |
+
|
| 54 |
+
- Basic regulatory question answering
|
| 55 |
+
- Contract provision drafting
|
| 56 |
+
- Structured JSON information extraction
|
| 57 |
+
- Foundation for downstream optimization
|
| 58 |
+
- Base model for domain-specific fine-tuning
|
| 59 |
+
|
| 60 |
+
## Performance
|
| 61 |
|
| 62 |
### Perplexity Scores
|
| 63 |
| Dataset | Score |
|
|
|
|
| 76 |
- **Enterprise Focus**: Specifically designed for legal, regulatory, and financial workflows.
|
| 77 |
- **Efficient Deployment**: Optimized for real-time inference on consumer hardware.
|
| 78 |
|
| 79 |
+
## Usage
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
|
| 81 |
+
Basic usage for text generation:
|
| 82 |
|
| 83 |
```python
|
| 84 |
import json
|
|
|
|
| 87 |
# Load the model and tokenizer
|
| 88 |
p = pipeline('text-generation', 'alea-institute/kl3m-003-1.7b', device='cuda')
|
| 89 |
|
| 90 |
+
# Example usage on GPU
|
| 91 |
text = "Under this"
|
| 92 |
print(
|
| 93 |
json.dumps(
|
|
|
|
| 104 |
[
|
| 105 |
"Under this section, any person who is a party to the proceeding may be required to file ",
|
| 106 |
"Under this subsection, the term **eligible entity** means a State, a political subdivision of ",
|
| 107 |
+
"Under this section, the Secretary shall— (1)\nmake a grant to the National Academy of Sc"
|
| 108 |
]
|
|
|
|
| 109 |
```
|
| 110 |
|
| 111 |
+
### Contract Example
|
| 112 |
+
|
| 113 |
```python
|
| 114 |
text = "Governing Law. "
|
| 115 |
print(
|
|
|
|
| 131 |
]
|
| 132 |
```
|
| 133 |
|
| 134 |
+
### Generation Parameters
|
| 135 |
+
|
| 136 |
+
The model supports various parameters to control the generation process:
|
| 137 |
+
|
| 138 |
+
- `temperature`: Controls randomness (lower = more deterministic)
|
| 139 |
+
- `top_p`: Nucleus sampling parameter (lower = more focused)
|
| 140 |
+
- `top_k`: Limits vocabulary selection to top k tokens
|
| 141 |
+
- `max_new_tokens`: Maximum number of tokens to generate
|
| 142 |
+
- `do_sample`: Whether to use sampling vs. greedy decoding
|
| 143 |
+
- `num_return_sequences`: Number of different sequences to generate
|
| 144 |
+
|
| 145 |
+
## Training
|
| 146 |
+
|
| 147 |
+
The model was originally trained between January-February 2024 on an 8xA100-80G node in DDP. A similar model is
|
| 148 |
+
being provided with complete source and data replication as part of the `kl3m-004` family to be released in Q4 2024.
|
| 149 |
|
| 150 |
The model implements several techniques during training:
|
| 151 |
|
|
|
|
| 154 |
- Randomized padding
|
| 155 |
- Traditional fixed-attention mechanisms
|
| 156 |
|
| 157 |
+
### Training Data
|
| 158 |
+
|
| 159 |
+
While the original training data collection and training infrastructure relies on software that was not donated by
|
| 160 |
+
273 Ventures, ALEA Institute is open-sourcing an improved dataset, including both replication and an API.
|
| 161 |
+
|
| 162 |
+
[https://github.com/alea-institute/kl3m-data](https://github.com/alea-institute/kl3m-data)
|
| 163 |
+
|
| 164 |
+
Data is available upon request at this time via S3 under a Requester Pays model. We are actively working on a
|
| 165 |
+
zero-cost distribution model as soon as we can obtain additional support.
|
| 166 |
+
|
| 167 |
+
This model, the original `kl3m-003-1.7b` model, was trained on a US-only subset of the Kelvin Legal DataPack that
|
| 168 |
+
we believe is 100% public domain material. However, so as to enforce maximum transparency to all
|
| 169 |
+
downstream users in the event of any future determination otherwise, we are licensing this model under CC-BY 4.0.
|
| 170 |
+
|
| 171 |
+
## Intended Usage
|
| 172 |
+
|
| 173 |
+
This model is intended for use in:
|
| 174 |
+
|
| 175 |
+
- Legal and regulatory document processing systems
|
| 176 |
+
- Contract drafting assistance
|
| 177 |
+
- Financial and enterprise document workflows
|
| 178 |
+
- Educational contexts for learning about domain-specific language models
|
| 179 |
+
- Research on efficient language models for domain-specific applications
|
| 180 |
+
|
| 181 |
+
## Special Tokens
|
| 182 |
+
|
| 183 |
+
kl3m-003-1.7b uses standard special tokens from the GPT-NeoX architecture.
|
| 184 |
+
|
| 185 |
+
## Limitations
|
| 186 |
+
|
| 187 |
+
- As a small language model (1.7B parameters), it has limited general knowledge
|
| 188 |
+
- Not instruction-tuned or aligned with human preferences
|
| 189 |
+
- May generate plausible-sounding but incorrect legal or regulatory text
|
| 190 |
+
- Not a substitute for professional legal advice or domain expertise
|
| 191 |
+
- Performance is optimized for legal and financial domains; general performance may be lower
|
| 192 |
+
|
| 193 |
+
## Ethical Considerations
|
| 194 |
+
|
| 195 |
+
- This model should not be used to generate legal advice without human expert review
|
| 196 |
+
- The model may reflect biases present in the training data despite efforts to use clean data
|
| 197 |
+
- Generated text should be reviewed by qualified professionals before use in formal legal contexts
|
| 198 |
+
- While trained on ethically sourced data, users should verify outputs for accuracy and appropriateness
|
| 199 |
+
|
| 200 |
+
## Source
|
| 201 |
+
|
| 202 |
+
[https://github.com/alea-institute/kl3m-model-research](https://github.com/alea-institute/kl3m-model-research)
|
| 203 |
+
|
| 204 |
+
## References
|
| 205 |
+
|
| 206 |
+
- [KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications](https://arxiv.org/abs/2503.17247)
|
| 207 |
+
- Additional tokenizer, dataset, and model publications are pending.
|
| 208 |
+
|
| 209 |
+
## Citation
|
| 210 |
+
|
| 211 |
+
```bibtex
|
| 212 |
+
@misc{kl3m-003-1.7b,
|
| 213 |
+
author = {ALEA Institute},
|
| 214 |
+
title = {kl3m-003-1.7b: A Small Language Model for Legal and Regulatory Text},
|
| 215 |
+
year = {2024},
|
| 216 |
+
publisher = {Hugging Face},
|
| 217 |
+
howpublished = {\url{https://huggingface.co/alea-institute/kl3m-003-1.7b}}
|
| 218 |
+
}
|
| 219 |
+
|
| 220 |
+
@article{bommarito2025kl3m,
|
| 221 |
+
title={KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications},
|
| 222 |
+
author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian},
|
| 223 |
+
journal={arXiv preprint arXiv:2503.17247},
|
| 224 |
+
year={2025}
|
| 225 |
+
}
|
| 226 |
+
```
|
| 227 |
+
|
| 228 |
## License
|
| 229 |
|
| 230 |
This model was originally developed by 273 Ventures and has been donated to the ALEA Institute.
|
|
|
|
| 243 |
|
| 244 |
Special thanks to 273 Ventures for developing and donating this model to the open-source community through the Alea Institute.
|
| 245 |
|
| 246 |
+

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|