Post
37
π AutoXLA - Accelerating Large Models on TPU
AutoXLA is an experimental library that automates the distribution, optimization, and quantization of large language models for TPUs using PyTorch/XLA. It extends the Hugging Face Transformers interface with TPU-aware features such as automatic sharding, custom attention kernels, and quantization-aware loading, making large-scale deployment and training both simpler and faster.
With quantization and Splash Attention kernels, AutoXLA achieves up to 4Γ speedups over standard Flash Attention implementations, significantly improving throughput for both inference and training workloads.
Whether youβre experimenting with distributed setups (FSDP, 2D, or 3D sharding) or optimizing memory via LanguageModelQuantizer, AutoXLA is built to make scaling LLMs on TPU seamless.
β οΈ Note: This is an experimental repository. Expect rough edges! Please report bugs or unexpected behavior through GitHub issues.
π GitHub Repository: https://github.com/Locutusque/AutoXLA
AutoXLA is an experimental library that automates the distribution, optimization, and quantization of large language models for TPUs using PyTorch/XLA. It extends the Hugging Face Transformers interface with TPU-aware features such as automatic sharding, custom attention kernels, and quantization-aware loading, making large-scale deployment and training both simpler and faster.
With quantization and Splash Attention kernels, AutoXLA achieves up to 4Γ speedups over standard Flash Attention implementations, significantly improving throughput for both inference and training workloads.
Whether youβre experimenting with distributed setups (FSDP, 2D, or 3D sharding) or optimizing memory via LanguageModelQuantizer, AutoXLA is built to make scaling LLMs on TPU seamless.
β οΈ Note: This is an experimental repository. Expect rough edges! Please report bugs or unexpected behavior through GitHub issues.
π GitHub Repository: https://github.com/Locutusque/AutoXLA