stable-ai
/

LimiX-16M

Model card Files Files and versions

xet

Community

xuange commited on 15 days ago

Commit

cb94d24

verified ·

1 Parent(s): 4fd2dba

Update README.md

Browse files

Files changed (1) hide show

README.md +92 -40

README.md CHANGED Viewed

@@ -1,46 +1,50 @@
----
-license: apache-2.0
-datasets:
-- stableai-org/bcco_cls
-- stableai-org/bcco_reg
-language:
-- en
-- zh
----
 <div align="center">
-  <h1>LimiX</h1>
 </div>
 <div align="center">
-  <img src="https://raw.githubusercontent.com/limix-ldm/LimiX/refs/heads/main/doc/LimiX-Logo.png" alt="LimiX logo" width="89%">
 </div>
-# News :boom:
- - 2025-08-29: LimiX V1.0 Released.
 # ➤ Overview
-We posit that progress toward general intelligence will require different complementary classes of foundation models, each anchored to a distinct data modality and set of inductive biases. large language models (LLMs) provide a universal interface for natural and programming languages and have rapidly advanced instruction following, tool use, and explicit reasoning over token sequences. In real-world scenarios involving structured data, LLMs still rely primarily on statistical correlations between word sequences, which limits their ability to accurately capture numerical relationships and causal rules. In contrast, large structured-data models (LDMs) are trained on heterogeneous tabular and relational data to capture conditional and joint dependencies, support diverse tasks and applications, enable robust prediction under distribution shifts, handle missingness, and facilitate counterfactual analysis and feature attribution. Here, we introduce LimiX, the first installment of our LDM series. LimiX aims to push generality further: a single model that handles classification, regression, missing-value imputation, feature selection, sample selection, and causal inference under one training and inference recipe, advancing the shift from bespoke pipelines to unified, foundation-style tabular learning.
 LimiX adopts a transformer architecture optimized for structured data modeling and task generalization. The model first embeds features X and targets Y from the prior knowledge base into token representations. Within the core modules, attention mechanisms are applied across both sample and feature dimensions to identify salient patterns in key samples and features. The resulting high-dimensional representations are then passed to regression and classification heads, enabling the model to support diverse predictive tasks.
-For details, please refer to the technical report at the link: [LimiX_Technical_Report.pdf](https://github.com/limix-ldm/LimiX/blob/main/LimiX_Technical_Report.pdf)
-# ➤ Comparative experimental results
 The LimiX model achieved SOTA performance across multiple tasks.
-## ➩ Classification comparison results
 <div align="center">
-  <img src="https://raw.githubusercontent.com/limix-ldm/LimiX/refs/heads/main/doc/Classifier.png" alt="Classification" width="80%">
 </div>
-## ➩ Regression comparison results
 <div align="center">
-  <img src="https://raw.githubusercontent.com/limix-ldm/LimiX/refs/heads/main/doc/Regression.png" alt="Regression" width="80%">
 </div>
-## ➩ Missing value imputation comparison results
 <div align="center">
-  <img src="https://raw.githubusercontent.com/limix-ldm/LimiX/refs/heads/main/doc/MissingValueImputation.png" alt="Missing value imputation" width="80%">
 </div>
 # ➤ Tutorials
@@ -58,9 +62,9 @@ wget -O flash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64
 ```
 Install Python dependencies
 ```bash
-pip install python=3.12.7 torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1
 pip install flash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
-pip install scikit-learn  einops  huggingface-hub matplotlib networkx numpy pandas  scipy tqdm typing_extensions xgboost
 ```
 ### Download source code
@@ -75,6 +79,7 @@ LimiX supports tasks such as classification, regression, and missing value imput
 | Model size | Download link | Tasks supported |
 | --- | --- | --- |
 | LimiX-16M | [LimiX-16M.ckpt](https://huggingface.co/stableai-org/LimiX-16M/tree/main) |  ✅ classification  ✅regression   ✅missing value imputation |
 ## ➩ Interface description
@@ -117,22 +122,31 @@ def predict(self, x_train:np.ndarray, y_train:np.ndarray, x_test:np.ndarray) ->
 | y_train  | np.ndarray  | The target variable of the training set   |
 | x_test   | np.ndarray  | The input features of the test set   |
 ## ➩ Ensemble Inference Based on Sample Retrieval
 For a detailed technical introduction to Ensemble Inference Based on Sample Retrieval, please refer to the [technical report](https://github.com/limix-ldm/LimiX/blob/main/LimiX_Technical_Report.pdf).
-Considering inference speed, ensemble inference based on sample retrieval currently only supports hardware with specifications higher than the NVIDIA RTX 4090 GPU.
 ### Classification Task
 ```
-torchrun --nproc_per_node=8 inference_classifier.py --save_name your_save_name --inference_config_path path_to_config --data_dir path_to_data
 ```
 ### Regression Task
 ```
-torchrun --nproc_per_node=8 inference_regression.py --save_name your_save_name --inference_config_path path_to_config --data_dir path_to_data
 ```
 ### Customizing Data Preprocessing for Inference Tasks
@@ -146,27 +160,45 @@ generate_inference_config()
 #### Single GPU or CPU
 ```
-python  inference_classifier.py --save_name your_save_name --inference_config_path path_to_config --data_dir path_to_data
 ```
 #### Multi-GPU Distributed Inference
 ```
-torchrun --nproc_per_node=8  inference_classifier.py --save_name your_save_name --inference_config_path path_to_config --data_dir path_to_data --inference_with_DDP
 ```
 ### Regression Task
 #### Single GPU or CPU
 ```
-python  inference_regression.py --save_name your_save_name --inference_config_path path_to_config --data_dir path_to_data
 ```
 #### Multi-GPU Distributed Inference
 ```
-torchrun --nproc_per_node=8  inference_regression.py --save_name your_save_name --inference_config_path path_to_config --data_dir path_to_data --inference_with_DDP
 ```
 ## ➩ Classification
 ```python
@@ -177,6 +209,11 @@ from huggingface_hub import hf_hub_download
 import numpy as np
 import os, sys
 ROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
 if ROOT_DIR not in sys.path:
     sys.path.insert(0, ROOT_DIR)
@@ -185,9 +222,9 @@ from inference.predictor import LimiXPredictor
 X, y = load_breast_cancer(return_X_y=True)
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
-model_file = hf_hub_download(repo_id="stableai-org/LimiX-16M", filename="LimiX-16M.ckpt", local_dir=".")
-clf = LimiXPredictor(device='cuda', model_path='your model path', inference_config='config/cls_default_noretrieval.json')
 prediction = clf.predict(X_train, y_train, X_test)
 print("roc_auc_score:", roc_auc_score(y_test, prediction[:, 1]))
@@ -207,8 +244,14 @@ try:
     from sklearn.metrics import root_mean_squared_error as mean_squared_error
 except:
     from sklearn.metrics import mean_squared_error
-    mean_squared_error = partial(mean_squared_error, squared=True)
 import os, sys
 ROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
 if ROOT_DIR not in sys.path:
     sys.path.insert(0, ROOT_DIR)
@@ -223,10 +266,9 @@ y_std = y_train.std()
 y_train_normalized = (y_train - y_mean) / y_std
 y_test_normalized = (y_test - y_mean) / y_std
-data_device = f'cuda:0'
-model_path = hf_hub_download(repo_id="stableai-org/LimiX-16M", filename="LimiX-16M.ckpt", local_dir=".")
-model = LimiXPredictor(device='cuda', model_path=model_path, inference_config='config/reg_default_noretrieval.json')
 y_pred = model.predict(X_train, y_train_normalized, X_test)
 # Compute RMSE and R²
@@ -237,17 +279,27 @@ r2 = r2_score(y_test_normalized, y_pred)
 print(f'RMSE: {rmse}')
 print(f'R2: {r2}')
 ```
-For additional examples, refer to [inference_regression.py](./inference_regression.py)
 ## ➩ Missing value imputation
-For the demo file, see [examples/demo_missing_value_imputation.py](examples/inference_regression.py)
 # ➤ Link
  - LimiX Technical Report: [LimiX_Technical_Report.pdf](https://github.com/limix-ldm/LimiX/blob/main/LimiX_Technical_Report.pdf)
  - Balance Comprehensive Challenging Omni-domain Classification Benchmark: [bcco_cls](https://huggingface.co/datasets/stableai-org/bcco_cls)
  - Balance Comprehensive Challenging Omni-domain Regression Benchmark: [bcco_reg](https://huggingface.co/datasets/stableai-org/bcco_reg)
 # ➤ License
 The code in this repository is open-sourced under the [Apache-2.0](LICENSE.txt) license, while the usage of the LimiX model weights is subject to the Model License. The LimiX weights are fully available for academic research and may be used commercially upon obtaining proper authorization.
-# ➤ Reference

 <div align="center">
+  <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/LimiX-Logo.png" alt="LimiX summary" width="89%">
 </div>
+#  💥 News
+ - 2025-11-10: LimiX-2M is officially released! Compared to LimiX-16M, this smaller variant offers significantly lower GPU memory usage and faster inference speed. The retrieval mechanism has also been enhanced, further improving model performance while reducing both inference time and memory consumption.
+ - 2025-08-29: LimiX V1.0 Released.
+#  ⚡ Latest Results Compared with SOTA Models
 <div align="center">
+  <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/BCCO-CLS.png"  width="30%" style="display:inline-block;">
+  <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/TabArena-CLS.png"  width="30%" style="display:inline-block;">
+  <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/TabZilla-CLS.png" width="30%" style="display:inline-block;">
+</div>
+<div align="center">
+  <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/BCCO-REG.png"  width="30%" style="display:inline-block;">
+  <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/TabArena-REG.png" width="30%" style="display:inline-block;">
+  <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/CTR23-REG.png" width="30%" style="display:inline-block;">
 </div>
 # ➤ Overview
+<div align="center">
+  <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/LimiX_Summary.png" alt="LimiX summary" width="89%">
+</div>
+We introduce LimiX, the first installment of our LDM series. LimiX aims to push generality further: a single model that handles classification, regression, missing-value imputation, feature selection, sample selection, and causal inference under one training and inference recipe, advancing the shift from bespoke pipelines to unified, foundation-style tabular learning.
 LimiX adopts a transformer architecture optimized for structured data modeling and task generalization. The model first embeds features X and targets Y from the prior knowledge base into token representations. Within the core modules, attention mechanisms are applied across both sample and feature dimensions to identify salient patterns in key samples and features. The resulting high-dimensional representations are then passed to regression and classification heads, enabling the model to support diverse predictive tasks.
+For details, please refer to the technical report at the link: [LimiX:Unleashing Structured-Data Modeling Capability for Generalist Intelligence](https://arxiv.org/abs/2509.03505) or [LimiX_Technical_Report.pdf](https://github.com/limix-ldm/LimiX/blob/main/LimiX_Technical_Report.pdf).
+# ➤ Superior Performance
 The LimiX model achieved SOTA performance across multiple tasks.
+## ➩ Classification (Tech Report)
 <div align="center">
+  <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/Classifier.png" alt="Classification" width="80%">
 </div>
+## ➩ Regression (Tech Report)
 <div align="center">
+  <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/Regression.png" alt="Regression" width="60%">
 </div>
+## ➩ Missing Values Imputation (Tech Report)
 <div align="center">
+  <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/MissingValueImputation.png" alt="Missing value imputation" width="80%">
 </div>
 # ➤ Tutorials
 ```
 Install Python dependencies
 ```bash
+pip install python==3.12.7 torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1
 pip install flash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
+pip install scikit-learn  einops  huggingface-hub matplotlib networkx numpy pandas  scipy tqdm typing_extensions xgboost kditransform hyperopt
 ```
 ### Download source code
 | Model size | Download link | Tasks supported |
 | --- | --- | --- |
 | LimiX-16M | [LimiX-16M.ckpt](https://huggingface.co/stableai-org/LimiX-16M/tree/main) |  ✅ classification  ✅regression   ✅missing value imputation |
+| LimiX-2M | [LimiX-2M.ckpt](https://huggingface.co/stableai-org/LimiX-2M/tree/main) |  ✅ classification  ✅regression   ✅missing value imputation |
 ## ➩ Interface description
 | y_train  | np.ndarray  | The target variable of the training set   |
 | x_test   | np.ndarray  | The input features of the test set   |
+## Inference Configuration File Description
+| Configuration File Name | Description | Difference |
+| ------- | ---------- | ----- |
+| cls_default_retrieval.json | Default **classification task** inference configuration file **with retrieval** | Better classification performance |
+| cls_default_noretrieval.json | Default **classification task** inference configuration file **without retrieval** | Faster speed, lower memory requirements |
+| reg_default_retrieval.json | Default **regression task** inference configuration file **with retrieval** | Better regression performance |
+| reg_default_noretrieval.json | Default **regression task** inference configuration file **without retrieval** | Faster speed, lower memory requirements |
+| reg_default_noretrieval_MVI.json | Default inference configuration file for **missing value imputation task** |  |
 ## ➩ Ensemble Inference Based on Sample Retrieval
 For a detailed technical introduction to Ensemble Inference Based on Sample Retrieval, please refer to the [technical report](https://github.com/limix-ldm/LimiX/blob/main/LimiX_Technical_Report.pdf).
+Considering inference speed and memory requirements, ensemble inference based on sample retrieval currently only supports hardware with specifications higher than the NVIDIA RTX 4090 GPU.
 ### Classification Task
 ```
+python inference_classifier.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
 ```
 ### Regression Task
 ```
+python inference_regression.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
 ```
 ### Customizing Data Preprocessing for Inference Tasks
 #### Single GPU or CPU
 ```
+python  inference_classifier.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
 ```
 #### Multi-GPU Distributed Inference
 ```
+torchrun --nproc_per_node=8  inference_classifier.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data --inference_with_DDP
 ```
 ### Regression Task
 #### Single GPU or CPU
 ```
+python  inference_regression.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
 ```
 #### Multi-GPU Distributed Inference
 ```
+torchrun --nproc_per_node=8  inference_regression.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data --inference_with_DDP
+```
+### Retrieval Optimization Project
+This project implements an optimized retrieval system. To achieve the best performance, we utilize Optuna for hyperparameter tuning of retrieval parameters.
+#### Installation
+Ensure you have the required dependencies installed:
+```
+pip install optuna
+```
+#### Usage
+For standard inference using pre-optimized parameters, refer to the code below:
 ```
+searchInference = RetrievalSearchHyperparameters(
+           dict(device_id=0,model_path=model_path), X_train, y_train, X_test, y_test,
+)
+config, result = searchInference.search(n_trials=10, metric="AUC",
+              inference_config='config/cls_default_retrieval.json',task_type="cls")
+```
+This will launch an Optuna study to find the best combination of retrieval parameters for your specific dataset and use case.
 ## ➩ Classification
 ```python
 import numpy as np
 import os, sys
+os.environ["RANK"] = "0"
+os.environ["WORLD_SIZE"] = "1"
+os.environ["MASTER_ADDR"] = "127.0.0.1"
+os.environ["MASTER_PORT"] = "29500"
 ROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
 if ROOT_DIR not in sys.path:
     sys.path.insert(0, ROOT_DIR)
 X, y = load_breast_cancer(return_X_y=True)
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
+model_file = hf_hub_download(repo_id="stableai-org/LimiX-16M", filename="LimiX-16M.ckpt", local_dir="./cache")
+clf = LimiXPredictor(device='cuda', model_path=model_file, inference_config='config/cls_default_retrieval.json')
 prediction = clf.predict(X_train, y_train, X_test)
 print("roc_auc_score:", roc_auc_score(y_test, prediction[:, 1]))
     from sklearn.metrics import root_mean_squared_error as mean_squared_error
 except:
     from sklearn.metrics import mean_squared_error
+    mean_squared_error = partial(mean_squared_error, squared=False)
 import os, sys
+os.environ["RANK"] = "0"
+os.environ["WORLD_SIZE"] = "1"
+os.environ["MASTER_ADDR"] = "127.0.0.1"
+os.environ["MASTER_PORT"] = "29500"
 ROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
 if ROOT_DIR not in sys.path:
     sys.path.insert(0, ROOT_DIR)
 y_train_normalized = (y_train - y_mean) / y_std
 y_test_normalized = (y_test - y_mean) / y_std
+model_path = hf_hub_download(repo_id="stableai-org/LimiX-16M", filename="LimiX-16M.ckpt", local_dir="./cache")
+model = LimiXPredictor(device='cuda', model_path=model_path, inference_config='config/reg_default_retrieval.json')
 y_pred = model.predict(X_train, y_train_normalized, X_test)
 # Compute RMSE and R²
 print(f'RMSE: {rmse}')
 print(f'R2: {r2}')
 ```
+For additional examples, refer to [inference_regression.py](https://github.com/limix-ldm/LimiX/raw/main/inference_regression.py)
 ## ➩ Missing value imputation
+For the demo file, see [demo_missing_value_imputation.py](https://github.com/limix-ldm/LimiX/raw/main/examples/inference_regression.py)
 # ➤ Link
+ - LimiX:Unleashing Structured-Data Modeling Capability for Generalist Intelligence: [LimiX:Unleashing Structured-Data Modeling Capability for Generalist Intelligence](https://arxiv.org/abs/2509.03505)
  - LimiX Technical Report: [LimiX_Technical_Report.pdf](https://github.com/limix-ldm/LimiX/blob/main/LimiX_Technical_Report.pdf)
+ - Detailed instructions for using Limix: [Visit the official Limix documentation](https://www.limix.ai/doc/)
  - Balance Comprehensive Challenging Omni-domain Classification Benchmark: [bcco_cls](https://huggingface.co/datasets/stableai-org/bcco_cls)
  - Balance Comprehensive Challenging Omni-domain Regression Benchmark: [bcco_reg](https://huggingface.co/datasets/stableai-org/bcco_reg)
 # ➤ License
 The code in this repository is open-sourced under the [Apache-2.0](LICENSE.txt) license, while the usage of the LimiX model weights is subject to the Model License. The LimiX weights are fully available for academic research and may be used commercially upon obtaining proper authorization.
+# ➤ Citation
+```
+@article{LimiX,
+  title={LimiX:Unleashing Structured-Data Modeling Capability for Generalist Intelligence},
+  author={LimiXTeam},
+  journal={arXiv preprint arXiv:2509.03505},
+  year={2025}
+}
+```