alea-institute commited on
Commit
1bf22ea
·
verified ·
1 Parent(s): 38203d5

Update README with KL3M tokenizer paper citation - README.md

Browse files
Files changed (1) hide show
  1. README.md +115 -51
README.md CHANGED
@@ -10,6 +10,7 @@ tags:
10
  - financial
11
  - enterprise
12
  - slm
 
13
  date: '2024-02-20T00:00:00.000Z'
14
  pipeline_tag: text-generation
15
  widget:
@@ -18,9 +19,9 @@ widget:
18
  - do_sample: True
19
  ---
20
 
21
- # kl3m-003-1.7b Model
22
 
23
- kl3m-1.7b is a small language model (SLM) model trained on clean, legally-permissible data. Originally
24
  developed by [273 Ventures](https://273ventures.com) and donated to the [ALEA Institute](https://aleainstitute.ai),
25
  kl3m-003-1.7b was part of the first LLM family to obtain the [Fairly Trained L-Certification](https://www.fairlytrained.org/certifications)
26
  for its ethical training data and practices. The model is designed for legal, regulatory, and financial workflows,
@@ -29,40 +30,34 @@ with a focus on low toxicity and high efficiency.
29
  Given its small size and lack of training data for instruction alignment, kl3m-003-1.7b is best suited for use either in
30
  SLM fine-tuning or as part of training larger models without using unethical data or models.
31
 
32
- The model was originally trained between January-February 2024 on a 8xA100-80G node in DDP. A similar model is
33
- being provided with complete source and data replication as part of the `kl3m-004` family to be released in Q4 2024.
34
-
35
- ## Source
36
-
37
- [https://github.com/alea-institute/kl3m-model-research](https://github.com/alea-institute/kl3m-model-research)
38
-
39
-
40
- ## Training Data
41
- While the original training data collection and training infrastructure relies on software that was not donated by
42
- 273 Ventures, ALEA Institute is open-sourcing an improved dataset, including both replication and an API.
43
-
44
- [https://github.com/alea-institute/kl3m-data](https://github.com/alea-institute/kl3m-data)
45
-
46
- Data is available upon request at this time via S3 under a Requester Pays model. We are actively working on a
47
- zero-cost distribution model as soon as we can obtain additional support.
48
-
49
- This model, the original `kl3m-003-1.7b` model, was trained on a US-only subset of the Kelvin Legal DataPack that
50
- we believe is 100% public domain material. However, so as to enforce maximum transparency to all
51
- downstream users in the event of any future determination otherwise, we are licensing this model under CC-BY 4.0.
52
-
53
  ## Model Details
54
 
55
- ### Summary
56
  - **Architecture**: GPT-NeoX (i.e., ~GPT-3 architecture)
57
- - **Parameters**: 1.7 billion
 
 
 
 
 
58
  - **Context Window**: 8,192 tokens (true size, no sliding window)
 
59
  - **Language(s)**: Primarily English
60
- - **Tokenizer**: kl3m-001-32k BPE tokenizer (32,768 vocabulary size with unorthodox whitespace handling)
61
  - **Developed by**: Originally by [273 Ventures LLC](https://273ventures.com), donated to [ALEA Institute](https://aleainstitute.ai)
62
  - **License**: [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)
63
  - **Hardware Requirements**: Runs real-time in bf16 on consumer NV/AMD GPUs
64
 
65
- ## Performance Metrics
 
 
 
 
 
 
 
 
 
 
66
 
67
  ### Perplexity Scores
68
  | Dataset | Score |
@@ -81,15 +76,9 @@ larger models as of its training data.
81
  - **Enterprise Focus**: Specifically designed for legal, regulatory, and financial workflows.
82
  - **Efficient Deployment**: Optimized for real-time inference on consumer hardware.
83
 
84
- ## Use Cases
85
-
86
- - Basic regulatory question answering
87
- - Contract provision drafting
88
- - Structured JSON information extraction
89
- - Foundation for downstream optimization
90
- - Base model for domain-specific fine-tuning
91
 
92
- ## Getting Started
93
 
94
  ```python
95
  import json
@@ -98,7 +87,7 @@ from transformers import pipeline
98
  # Load the model and tokenizer
99
  p = pipeline('text-generation', 'alea-institute/kl3m-003-1.7b', device='cuda')
100
 
101
- # Example usage on CPU
102
  text = "Under this"
103
  print(
104
  json.dumps(
@@ -115,12 +104,12 @@ print(
115
  [
116
  "Under this section, any person who is a party to the proceeding may be required to file ",
117
  "Under this subsection, the term **eligible entity** means a State, a political subdivision of ",
118
- "Under this section, the Secretary shall\u2014 (1)\nmake a grant to the National Academy of Sc"
119
  ]
120
-
121
  ```
122
 
123
- ## Contract Example
 
124
  ```python
125
  text = "Governing Law. "
126
  print(
@@ -142,7 +131,21 @@ print(
142
  ]
143
  ```
144
 
145
- ## Technical Implementation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
146
 
147
  The model implements several techniques during training:
148
 
@@ -151,6 +154,77 @@ The model implements several techniques during training:
151
  - Randomized padding
152
  - Traditional fixed-attention mechanisms
153
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
154
  ## License
155
 
156
  This model was originally developed by 273 Ventures and has been donated to the ALEA Institute.
@@ -169,14 +243,4 @@ The KL3M model family is now maintained by the [ALEA Institute](https://aleainst
169
 
170
  Special thanks to 273 Ventures for developing and donating this model to the open-source community through the Alea Institute.
171
 
172
-
173
- ## Citation
174
-
175
- Tokenizer, dataset, and model publications are pending.
176
-
177
- ## Contact
178
-
179
- For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [[email protected]](mailto:[email protected]) or
180
- create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-model-research).
181
-
182
- ![https://aleainstitute.ai](https://aleainstitute.ai/images/alea-logo-ascii-1x1.png)
 
10
  - financial
11
  - enterprise
12
  - slm
13
+ - gpt-neox
14
  date: '2024-02-20T00:00:00.000Z'
15
  pipeline_tag: text-generation
16
  widget:
 
19
  - do_sample: True
20
  ---
21
 
22
+ # kl3m-003-1.7b
23
 
24
+ kl3m-003-1.7b is a small language model (SLM) trained on clean, legally-permissible data. Originally
25
  developed by [273 Ventures](https://273ventures.com) and donated to the [ALEA Institute](https://aleainstitute.ai),
26
  kl3m-003-1.7b was part of the first LLM family to obtain the [Fairly Trained L-Certification](https://www.fairlytrained.org/certifications)
27
  for its ethical training data and practices. The model is designed for legal, regulatory, and financial workflows,
 
30
  Given its small size and lack of training data for instruction alignment, kl3m-003-1.7b is best suited for use either in
31
  SLM fine-tuning or as part of training larger models without using unethical data or models.
32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  ## Model Details
34
 
 
35
  - **Architecture**: GPT-NeoX (i.e., ~GPT-3 architecture)
36
+ - **Size**: 1.7 billion parameters
37
+ - **Hidden Size**: 2048
38
+ - **Layers**: 32
39
+ - **Attention Heads**: 32
40
+ - **Intermediate Size**: 8192
41
+ - **Max Position Embeddings**: 8192
42
  - **Context Window**: 8,192 tokens (true size, no sliding window)
43
+ - **Tokenizer**: [kl3m-001-32k](https://huggingface.co/alea-institute/kl3m-001-32k) BPE tokenizer (32,768 vocabulary size with unorthodox whitespace handling)
44
  - **Language(s)**: Primarily English
45
+ - **Training Objective**: Next token prediction
46
  - **Developed by**: Originally by [273 Ventures LLC](https://273ventures.com), donated to [ALEA Institute](https://aleainstitute.ai)
47
  - **License**: [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)
48
  - **Hardware Requirements**: Runs real-time in bf16 on consumer NV/AMD GPUs
49
 
50
+ ## Use Cases
51
+
52
+ kl3m-003-1.7b is particularly effective for:
53
+
54
+ - Basic regulatory question answering
55
+ - Contract provision drafting
56
+ - Structured JSON information extraction
57
+ - Foundation for downstream optimization
58
+ - Base model for domain-specific fine-tuning
59
+
60
+ ## Performance
61
 
62
  ### Perplexity Scores
63
  | Dataset | Score |
 
76
  - **Enterprise Focus**: Specifically designed for legal, regulatory, and financial workflows.
77
  - **Efficient Deployment**: Optimized for real-time inference on consumer hardware.
78
 
79
+ ## Usage
 
 
 
 
 
 
80
 
81
+ Basic usage for text generation:
82
 
83
  ```python
84
  import json
 
87
  # Load the model and tokenizer
88
  p = pipeline('text-generation', 'alea-institute/kl3m-003-1.7b', device='cuda')
89
 
90
+ # Example usage on GPU
91
  text = "Under this"
92
  print(
93
  json.dumps(
 
104
  [
105
  "Under this section, any person who is a party to the proceeding may be required to file ",
106
  "Under this subsection, the term **eligible entity** means a State, a political subdivision of ",
107
+ "Under this section, the Secretary shall (1)\nmake a grant to the National Academy of Sc"
108
  ]
 
109
  ```
110
 
111
+ ### Contract Example
112
+
113
  ```python
114
  text = "Governing Law. "
115
  print(
 
131
  ]
132
  ```
133
 
134
+ ### Generation Parameters
135
+
136
+ The model supports various parameters to control the generation process:
137
+
138
+ - `temperature`: Controls randomness (lower = more deterministic)
139
+ - `top_p`: Nucleus sampling parameter (lower = more focused)
140
+ - `top_k`: Limits vocabulary selection to top k tokens
141
+ - `max_new_tokens`: Maximum number of tokens to generate
142
+ - `do_sample`: Whether to use sampling vs. greedy decoding
143
+ - `num_return_sequences`: Number of different sequences to generate
144
+
145
+ ## Training
146
+
147
+ The model was originally trained between January-February 2024 on an 8xA100-80G node in DDP. A similar model is
148
+ being provided with complete source and data replication as part of the `kl3m-004` family to be released in Q4 2024.
149
 
150
  The model implements several techniques during training:
151
 
 
154
  - Randomized padding
155
  - Traditional fixed-attention mechanisms
156
 
157
+ ### Training Data
158
+
159
+ While the original training data collection and training infrastructure relies on software that was not donated by
160
+ 273 Ventures, ALEA Institute is open-sourcing an improved dataset, including both replication and an API.
161
+
162
+ [https://github.com/alea-institute/kl3m-data](https://github.com/alea-institute/kl3m-data)
163
+
164
+ Data is available upon request at this time via S3 under a Requester Pays model. We are actively working on a
165
+ zero-cost distribution model as soon as we can obtain additional support.
166
+
167
+ This model, the original `kl3m-003-1.7b` model, was trained on a US-only subset of the Kelvin Legal DataPack that
168
+ we believe is 100% public domain material. However, so as to enforce maximum transparency to all
169
+ downstream users in the event of any future determination otherwise, we are licensing this model under CC-BY 4.0.
170
+
171
+ ## Intended Usage
172
+
173
+ This model is intended for use in:
174
+
175
+ - Legal and regulatory document processing systems
176
+ - Contract drafting assistance
177
+ - Financial and enterprise document workflows
178
+ - Educational contexts for learning about domain-specific language models
179
+ - Research on efficient language models for domain-specific applications
180
+
181
+ ## Special Tokens
182
+
183
+ kl3m-003-1.7b uses standard special tokens from the GPT-NeoX architecture.
184
+
185
+ ## Limitations
186
+
187
+ - As a small language model (1.7B parameters), it has limited general knowledge
188
+ - Not instruction-tuned or aligned with human preferences
189
+ - May generate plausible-sounding but incorrect legal or regulatory text
190
+ - Not a substitute for professional legal advice or domain expertise
191
+ - Performance is optimized for legal and financial domains; general performance may be lower
192
+
193
+ ## Ethical Considerations
194
+
195
+ - This model should not be used to generate legal advice without human expert review
196
+ - The model may reflect biases present in the training data despite efforts to use clean data
197
+ - Generated text should be reviewed by qualified professionals before use in formal legal contexts
198
+ - While trained on ethically sourced data, users should verify outputs for accuracy and appropriateness
199
+
200
+ ## Source
201
+
202
+ [https://github.com/alea-institute/kl3m-model-research](https://github.com/alea-institute/kl3m-model-research)
203
+
204
+ ## References
205
+
206
+ - [KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications](https://arxiv.org/abs/2503.17247)
207
+ - Additional tokenizer, dataset, and model publications are pending.
208
+
209
+ ## Citation
210
+
211
+ ```bibtex
212
+ @misc{kl3m-003-1.7b,
213
+ author = {ALEA Institute},
214
+ title = {kl3m-003-1.7b: A Small Language Model for Legal and Regulatory Text},
215
+ year = {2024},
216
+ publisher = {Hugging Face},
217
+ howpublished = {\url{https://huggingface.co/alea-institute/kl3m-003-1.7b}}
218
+ }
219
+
220
+ @article{bommarito2025kl3m,
221
+ title={KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications},
222
+ author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian},
223
+ journal={arXiv preprint arXiv:2503.17247},
224
+ year={2025}
225
+ }
226
+ ```
227
+
228
  ## License
229
 
230
  This model was originally developed by 273 Ventures and has been donated to the ALEA Institute.
 
243
 
244
  Special thanks to 273 Ventures for developing and donating this model to the open-source community through the Alea Institute.
245
 
246
+ ![logo](https://aleainstitute.ai/images/alea-logo-ascii-1x1.png)