xuange commited on
Commit
cb94d24
·
verified ·
1 Parent(s): 4fd2dba

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +92 -40
README.md CHANGED
@@ -1,46 +1,50 @@
1
- ---
2
- license: apache-2.0
3
- datasets:
4
- - stableai-org/bcco_cls
5
- - stableai-org/bcco_reg
6
- language:
7
- - en
8
- - zh
9
- ---
10
  <div align="center">
11
- <h1>LimiX</h1>
12
  </div>
13
 
 
 
 
 
 
14
  <div align="center">
15
- <img src="https://raw.githubusercontent.com/limix-ldm/LimiX/refs/heads/main/doc/LimiX-Logo.png" alt="LimiX logo" width="89%">
 
 
 
 
 
 
 
16
  </div>
17
 
18
- # News :boom:
19
- - 2025-08-29: LimiX V1.0 Released.
20
 
21
  # ➤ Overview
22
- We posit that progress toward general intelligence will require different complementary classes of foundation models, each anchored to a distinct data modality and set of inductive biases. large language models (LLMs) provide a universal interface for natural and programming languages and have rapidly advanced instruction following, tool use, and explicit reasoning over token sequences. In real-world scenarios involving structured data, LLMs still rely primarily on statistical correlations between word sequences, which limits their ability to accurately capture numerical relationships and causal rules. In contrast, large structured-data models (LDMs) are trained on heterogeneous tabular and relational data to capture conditional and joint dependencies, support diverse tasks and applications, enable robust prediction under distribution shifts, handle missingness, and facilitate counterfactual analysis and feature attribution. Here, we introduce LimiX, the first installment of our LDM series. LimiX aims to push generality further: a single model that handles classification, regression, missing-value imputation, feature selection, sample selection, and causal inference under one training and inference recipe, advancing the shift from bespoke pipelines to unified, foundation-style tabular learning.
 
 
 
23
 
24
  LimiX adopts a transformer architecture optimized for structured data modeling and task generalization. The model first embeds features X and targets Y from the prior knowledge base into token representations. Within the core modules, attention mechanisms are applied across both sample and feature dimensions to identify salient patterns in key samples and features. The resulting high-dimensional representations are then passed to regression and classification heads, enabling the model to support diverse predictive tasks.
25
 
26
- For details, please refer to the technical report at the link: [LimiX_Technical_Report.pdf](https://github.com/limix-ldm/LimiX/blob/main/LimiX_Technical_Report.pdf)
27
 
28
- # ➤ Comparative experimental results
29
  The LimiX model achieved SOTA performance across multiple tasks.
30
 
31
- ## ➩ Classification comparison results
32
  <div align="center">
33
- <img src="https://raw.githubusercontent.com/limix-ldm/LimiX/refs/heads/main/doc/Classifier.png" alt="Classification" width="80%">
34
  </div>
35
 
36
- ## ➩ Regression comparison results
37
  <div align="center">
38
- <img src="https://raw.githubusercontent.com/limix-ldm/LimiX/refs/heads/main/doc/Regression.png" alt="Regression" width="80%">
39
  </div>
40
 
41
- ## ➩ Missing value imputation comparison results
42
  <div align="center">
43
- <img src="https://raw.githubusercontent.com/limix-ldm/LimiX/refs/heads/main/doc/MissingValueImputation.png" alt="Missing value imputation" width="80%">
44
  </div>
45
 
46
  # ➤ Tutorials
@@ -58,9 +62,9 @@ wget -O flash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64
58
  ```
59
  Install Python dependencies
60
  ```bash
61
- pip install python=3.12.7 torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1
62
  pip install flash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
63
- pip install scikit-learn einops huggingface-hub matplotlib networkx numpy pandas scipy tqdm typing_extensions xgboost
64
  ```
65
 
66
  ### Download source code
@@ -75,6 +79,7 @@ LimiX supports tasks such as classification, regression, and missing value imput
75
  | Model size | Download link | Tasks supported |
76
  | --- | --- | --- |
77
  | LimiX-16M | [LimiX-16M.ckpt](https://huggingface.co/stableai-org/LimiX-16M/tree/main) | ✅ classification ✅regression ✅missing value imputation |
 
78
 
79
  ## ➩ Interface description
80
 
@@ -117,22 +122,31 @@ def predict(self, x_train:np.ndarray, y_train:np.ndarray, x_test:np.ndarray) ->
117
  | y_train | np.ndarray | The target variable of the training set |
118
  | x_test | np.ndarray | The input features of the test set |
119
 
 
 
 
 
 
 
 
 
 
120
  ## ➩ Ensemble Inference Based on Sample Retrieval
121
 
122
  For a detailed technical introduction to Ensemble Inference Based on Sample Retrieval, please refer to the [technical report](https://github.com/limix-ldm/LimiX/blob/main/LimiX_Technical_Report.pdf).
123
 
124
- Considering inference speed, ensemble inference based on sample retrieval currently only supports hardware with specifications higher than the NVIDIA RTX 4090 GPU.
125
 
126
  ### Classification Task
127
 
128
  ```
129
- torchrun --nproc_per_node=8 inference_classifier.py --save_name your_save_name --inference_config_path path_to_config --data_dir path_to_data
130
  ```
131
 
132
  ### Regression Task
133
 
134
  ```
135
- torchrun --nproc_per_node=8 inference_regression.py --save_name your_save_name --inference_config_path path_to_config --data_dir path_to_data
136
  ```
137
 
138
  ### Customizing Data Preprocessing for Inference Tasks
@@ -146,27 +160,45 @@ generate_inference_config()
146
  #### Single GPU or CPU
147
 
148
  ```
149
- python inference_classifier.py --save_name your_save_name --inference_config_path path_to_config --data_dir path_to_data
150
  ```
151
 
152
  #### Multi-GPU Distributed Inference
153
 
154
  ```
155
- torchrun --nproc_per_node=8 inference_classifier.py --save_name your_save_name --inference_config_path path_to_config --data_dir path_to_data --inference_with_DDP
156
  ```
157
 
158
  ### Regression Task
159
  #### Single GPU or CPU
160
 
161
  ```
162
- python inference_regression.py --save_name your_save_name --inference_config_path path_to_config --data_dir path_to_data
163
  ```
164
 
165
  #### Multi-GPU Distributed Inference
166
 
167
  ```
168
- torchrun --nproc_per_node=8 inference_regression.py --save_name your_save_name --inference_config_path path_to_config --data_dir path_to_data --inference_with_DDP
 
 
 
 
 
 
 
 
 
 
 
169
  ```
 
 
 
 
 
 
 
170
 
171
  ## ➩ Classification
172
  ```python
@@ -177,6 +209,11 @@ from huggingface_hub import hf_hub_download
177
  import numpy as np
178
  import os, sys
179
 
 
 
 
 
 
180
  ROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
181
  if ROOT_DIR not in sys.path:
182
  sys.path.insert(0, ROOT_DIR)
@@ -185,9 +222,9 @@ from inference.predictor import LimiXPredictor
185
  X, y = load_breast_cancer(return_X_y=True)
186
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
187
 
188
- model_file = hf_hub_download(repo_id="stableai-org/LimiX-16M", filename="LimiX-16M.ckpt", local_dir=".")
189
 
190
- clf = LimiXPredictor(device='cuda', model_path='your model path', inference_config='config/cls_default_noretrieval.json')
191
  prediction = clf.predict(X_train, y_train, X_test)
192
 
193
  print("roc_auc_score:", roc_auc_score(y_test, prediction[:, 1]))
@@ -207,8 +244,14 @@ try:
207
  from sklearn.metrics import root_mean_squared_error as mean_squared_error
208
  except:
209
  from sklearn.metrics import mean_squared_error
210
- mean_squared_error = partial(mean_squared_error, squared=True)
211
  import os, sys
 
 
 
 
 
 
212
  ROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
213
  if ROOT_DIR not in sys.path:
214
  sys.path.insert(0, ROOT_DIR)
@@ -223,10 +266,9 @@ y_std = y_train.std()
223
  y_train_normalized = (y_train - y_mean) / y_std
224
  y_test_normalized = (y_test - y_mean) / y_std
225
 
226
- data_device = f'cuda:0'
227
- model_path = hf_hub_download(repo_id="stableai-org/LimiX-16M", filename="LimiX-16M.ckpt", local_dir=".")
228
 
229
- model = LimiXPredictor(device='cuda', model_path=model_path, inference_config='config/reg_default_noretrieval.json')
230
  y_pred = model.predict(X_train, y_train_normalized, X_test)
231
 
232
  # Compute RMSE and R²
@@ -237,17 +279,27 @@ r2 = r2_score(y_test_normalized, y_pred)
237
  print(f'RMSE: {rmse}')
238
  print(f'R2: {r2}')
239
  ```
240
- For additional examples, refer to [inference_regression.py](./inference_regression.py)
241
 
242
  ## ➩ Missing value imputation
243
- For the demo file, see [examples/demo_missing_value_imputation.py](examples/inference_regression.py)
244
 
245
  # ➤ Link
 
246
  - LimiX Technical Report: [LimiX_Technical_Report.pdf](https://github.com/limix-ldm/LimiX/blob/main/LimiX_Technical_Report.pdf)
 
247
  - Balance Comprehensive Challenging Omni-domain Classification Benchmark: [bcco_cls](https://huggingface.co/datasets/stableai-org/bcco_cls)
248
  - Balance Comprehensive Challenging Omni-domain Regression Benchmark: [bcco_reg](https://huggingface.co/datasets/stableai-org/bcco_reg)
249
 
250
  # ➤ License
251
  The code in this repository is open-sourced under the [Apache-2.0](LICENSE.txt) license, while the usage of the LimiX model weights is subject to the Model License. The LimiX weights are fully available for academic research and may be used commercially upon obtaining proper authorization.
252
 
253
- # ➤ Reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  <div align="center">
2
+ <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/LimiX-Logo.png" alt="LimiX summary" width="89%">
3
  </div>
4
 
5
+ # 💥 News
6
+ - 2025-11-10: LimiX-2M is officially released! Compared to LimiX-16M, this smaller variant offers significantly lower GPU memory usage and faster inference speed. The retrieval mechanism has also been enhanced, further improving model performance while reducing both inference time and memory consumption.
7
+ - 2025-08-29: LimiX V1.0 Released.
8
+
9
+ # ⚡ Latest Results Compared with SOTA Models
10
  <div align="center">
11
+ <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/BCCO-CLS.png" width="30%" style="display:inline-block;">
12
+ <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/TabArena-CLS.png" width="30%" style="display:inline-block;">
13
+ <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/TabZilla-CLS.png" width="30%" style="display:inline-block;">
14
+ </div>
15
+ <div align="center">
16
+ <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/BCCO-REG.png" width="30%" style="display:inline-block;">
17
+ <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/TabArena-REG.png" width="30%" style="display:inline-block;">
18
+ <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/CTR23-REG.png" width="30%" style="display:inline-block;">
19
  </div>
20
 
 
 
21
 
22
  # ➤ Overview
23
+ <div align="center">
24
+ <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/LimiX_Summary.png" alt="LimiX summary" width="89%">
25
+ </div>
26
+ We introduce LimiX, the first installment of our LDM series. LimiX aims to push generality further: a single model that handles classification, regression, missing-value imputation, feature selection, sample selection, and causal inference under one training and inference recipe, advancing the shift from bespoke pipelines to unified, foundation-style tabular learning.
27
 
28
  LimiX adopts a transformer architecture optimized for structured data modeling and task generalization. The model first embeds features X and targets Y from the prior knowledge base into token representations. Within the core modules, attention mechanisms are applied across both sample and feature dimensions to identify salient patterns in key samples and features. The resulting high-dimensional representations are then passed to regression and classification heads, enabling the model to support diverse predictive tasks.
29
 
30
+ For details, please refer to the technical report at the link: [LimiX:Unleashing Structured-Data Modeling Capability for Generalist Intelligence](https://arxiv.org/abs/2509.03505) or [LimiX_Technical_Report.pdf](https://github.com/limix-ldm/LimiX/blob/main/LimiX_Technical_Report.pdf).
31
 
32
+ # ➤ Superior Performance
33
  The LimiX model achieved SOTA performance across multiple tasks.
34
 
35
+ ## ➩ Classification (Tech Report)
36
  <div align="center">
37
+ <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/Classifier.png" alt="Classification" width="80%">
38
  </div>
39
 
40
+ ## ➩ Regression (Tech Report)
41
  <div align="center">
42
+ <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/Regression.png" alt="Regression" width="60%">
43
  </div>
44
 
45
+ ## ➩ Missing Values Imputation (Tech Report)
46
  <div align="center">
47
+ <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/MissingValueImputation.png" alt="Missing value imputation" width="80%">
48
  </div>
49
 
50
  # ➤ Tutorials
 
62
  ```
63
  Install Python dependencies
64
  ```bash
65
+ pip install python==3.12.7 torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1
66
  pip install flash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
67
+ pip install scikit-learn einops huggingface-hub matplotlib networkx numpy pandas scipy tqdm typing_extensions xgboost kditransform hyperopt
68
  ```
69
 
70
  ### Download source code
 
79
  | Model size | Download link | Tasks supported |
80
  | --- | --- | --- |
81
  | LimiX-16M | [LimiX-16M.ckpt](https://huggingface.co/stableai-org/LimiX-16M/tree/main) | ✅ classification ✅regression ✅missing value imputation |
82
+ | LimiX-2M | [LimiX-2M.ckpt](https://huggingface.co/stableai-org/LimiX-2M/tree/main) | ✅ classification ✅regression ✅missing value imputation |
83
 
84
  ## ➩ Interface description
85
 
 
122
  | y_train | np.ndarray | The target variable of the training set |
123
  | x_test | np.ndarray | The input features of the test set |
124
 
125
+ ## Inference Configuration File Description
126
+ | Configuration File Name | Description | Difference |
127
+ | ------- | ---------- | ----- |
128
+ | cls_default_retrieval.json | Default **classification task** inference configuration file **with retrieval** | Better classification performance |
129
+ | cls_default_noretrieval.json | Default **classification task** inference configuration file **without retrieval** | Faster speed, lower memory requirements |
130
+ | reg_default_retrieval.json | Default **regression task** inference configuration file **with retrieval** | Better regression performance |
131
+ | reg_default_noretrieval.json | Default **regression task** inference configuration file **without retrieval** | Faster speed, lower memory requirements |
132
+ | reg_default_noretrieval_MVI.json | Default inference configuration file for **missing value imputation task** | |
133
+
134
  ## ➩ Ensemble Inference Based on Sample Retrieval
135
 
136
  For a detailed technical introduction to Ensemble Inference Based on Sample Retrieval, please refer to the [technical report](https://github.com/limix-ldm/LimiX/blob/main/LimiX_Technical_Report.pdf).
137
 
138
+ Considering inference speed and memory requirements, ensemble inference based on sample retrieval currently only supports hardware with specifications higher than the NVIDIA RTX 4090 GPU.
139
 
140
  ### Classification Task
141
 
142
  ```
143
+ python inference_classifier.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
144
  ```
145
 
146
  ### Regression Task
147
 
148
  ```
149
+ python inference_regression.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
150
  ```
151
 
152
  ### Customizing Data Preprocessing for Inference Tasks
 
160
  #### Single GPU or CPU
161
 
162
  ```
163
+ python inference_classifier.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
164
  ```
165
 
166
  #### Multi-GPU Distributed Inference
167
 
168
  ```
169
+ torchrun --nproc_per_node=8 inference_classifier.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data --inference_with_DDP
170
  ```
171
 
172
  ### Regression Task
173
  #### Single GPU or CPU
174
 
175
  ```
176
+ python inference_regression.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
177
  ```
178
 
179
  #### Multi-GPU Distributed Inference
180
 
181
  ```
182
+ torchrun --nproc_per_node=8 inference_regression.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data --inference_with_DDP
183
+ ```
184
+
185
+ ### Retrieval Optimization Project
186
+ This project implements an optimized retrieval system. To achieve the best performance, we utilize Optuna for hyperparameter tuning of retrieval parameters.
187
+ #### Installation
188
+ Ensure you have the required dependencies installed:
189
+ ```
190
+ pip install optuna
191
+ ```
192
+ #### Usage
193
+ For standard inference using pre-optimized parameters, refer to the code below:
194
  ```
195
+ searchInference = RetrievalSearchHyperparameters(
196
+ dict(device_id=0,model_path=model_path), X_train, y_train, X_test, y_test,
197
+ )
198
+ config, result = searchInference.search(n_trials=10, metric="AUC",
199
+ inference_config='config/cls_default_retrieval.json',task_type="cls")
200
+ ```
201
+ This will launch an Optuna study to find the best combination of retrieval parameters for your specific dataset and use case.
202
 
203
  ## ➩ Classification
204
  ```python
 
209
  import numpy as np
210
  import os, sys
211
 
212
+ os.environ["RANK"] = "0"
213
+ os.environ["WORLD_SIZE"] = "1"
214
+ os.environ["MASTER_ADDR"] = "127.0.0.1"
215
+ os.environ["MASTER_PORT"] = "29500"
216
+
217
  ROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
218
  if ROOT_DIR not in sys.path:
219
  sys.path.insert(0, ROOT_DIR)
 
222
  X, y = load_breast_cancer(return_X_y=True)
223
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
224
 
225
+ model_file = hf_hub_download(repo_id="stableai-org/LimiX-16M", filename="LimiX-16M.ckpt", local_dir="./cache")
226
 
227
+ clf = LimiXPredictor(device='cuda', model_path=model_file, inference_config='config/cls_default_retrieval.json')
228
  prediction = clf.predict(X_train, y_train, X_test)
229
 
230
  print("roc_auc_score:", roc_auc_score(y_test, prediction[:, 1]))
 
244
  from sklearn.metrics import root_mean_squared_error as mean_squared_error
245
  except:
246
  from sklearn.metrics import mean_squared_error
247
+ mean_squared_error = partial(mean_squared_error, squared=False)
248
  import os, sys
249
+
250
+ os.environ["RANK"] = "0"
251
+ os.environ["WORLD_SIZE"] = "1"
252
+ os.environ["MASTER_ADDR"] = "127.0.0.1"
253
+ os.environ["MASTER_PORT"] = "29500"
254
+
255
  ROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
256
  if ROOT_DIR not in sys.path:
257
  sys.path.insert(0, ROOT_DIR)
 
266
  y_train_normalized = (y_train - y_mean) / y_std
267
  y_test_normalized = (y_test - y_mean) / y_std
268
 
269
+ model_path = hf_hub_download(repo_id="stableai-org/LimiX-16M", filename="LimiX-16M.ckpt", local_dir="./cache")
 
270
 
271
+ model = LimiXPredictor(device='cuda', model_path=model_path, inference_config='config/reg_default_retrieval.json')
272
  y_pred = model.predict(X_train, y_train_normalized, X_test)
273
 
274
  # Compute RMSE and R²
 
279
  print(f'RMSE: {rmse}')
280
  print(f'R2: {r2}')
281
  ```
282
+ For additional examples, refer to [inference_regression.py](https://github.com/limix-ldm/LimiX/raw/main/inference_regression.py)
283
 
284
  ## ➩ Missing value imputation
285
+ For the demo file, see [demo_missing_value_imputation.py](https://github.com/limix-ldm/LimiX/raw/main/examples/inference_regression.py)
286
 
287
  # ➤ Link
288
+ - LimiX:Unleashing Structured-Data Modeling Capability for Generalist Intelligence: [LimiX:Unleashing Structured-Data Modeling Capability for Generalist Intelligence](https://arxiv.org/abs/2509.03505)
289
  - LimiX Technical Report: [LimiX_Technical_Report.pdf](https://github.com/limix-ldm/LimiX/blob/main/LimiX_Technical_Report.pdf)
290
+ - Detailed instructions for using Limix: [Visit the official Limix documentation](https://www.limix.ai/doc/)
291
  - Balance Comprehensive Challenging Omni-domain Classification Benchmark: [bcco_cls](https://huggingface.co/datasets/stableai-org/bcco_cls)
292
  - Balance Comprehensive Challenging Omni-domain Regression Benchmark: [bcco_reg](https://huggingface.co/datasets/stableai-org/bcco_reg)
293
 
294
  # ➤ License
295
  The code in this repository is open-sourced under the [Apache-2.0](LICENSE.txt) license, while the usage of the LimiX model weights is subject to the Model License. The LimiX weights are fully available for academic research and may be used commercially upon obtaining proper authorization.
296
 
297
+ # ➤ Citation
298
+ ```
299
+ @article{LimiX,
300
+ title={LimiX:Unleashing Structured-Data Modeling Capability for Generalist Intelligence},
301
+ author={LimiXTeam},
302
+ journal={arXiv preprint arXiv:2509.03505},
303
+ year={2025}
304
+ }
305
+ ```