mozilla-ai
/

Meta-Llama-3.1-405B-llamafile

@@ -24,7 +24,10 @@ quantized_by: jartine
 This is a large language model that was released by Meta on 2024-07-23.
 As of its release date, this is the largest and most complex open
-weights model available.
 - Model creator: [Meta](https://huggingface.co/meta-llama/)
 - Original model: [meta-llama/Meta-Llama-3.1-405B](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B)
@@ -37,12 +40,21 @@ FreeBSD, OpenBSD and NetBSD systems you control on both AMD64 and ARM64.
 ## Quickstart
 Running the following on a desktop OS will launch a tab in your web
-browser.
 ```
-wget https://huggingface.co/Mozilla/Meta-Llama-3.1-405B-llamafile/resolve/main/Meta-Llama-3.1-405B.Q3_K_M.llamafile
-chmod +x Meta-Llama-3.1-405B.Q3_K_M.llamafile
-./Meta-Llama-3.1-405B.Q3_K_M.llamafile
 ```
 You can then use the completion mode of the GUI to experiment with this
@@ -53,9 +65,10 @@ model. You can prompt the model for completions on the command line too:
 ```
 This model has a max context window size of 128k tokens. By default, a
-context window size of 2048 tokens is used. You can use a larger context
 window by passing the `-c 8192` flag. The software currently has
-limitations that may prevent scaling to the full 128k size.
 On GPUs with sufficient RAM, the `-ngl 999` flag may be passed to use
 the system's NVIDIA or AMD GPU(s). On Windows, only the graphics card

 This is a large language model that was released by Meta on 2024-07-23.
 As of its release date, this is the largest and most complex open
+weights model available. This is the base model. It hasn't been fine
+tuned to follow your instructions. See also
+[Meta-Llama-3.1-405B-Instruct-llamafile](https://huggingface.co/Mozilla/Meta-Llama-3.1-405B-Instruct-llamafile)
+for a friendlier and more useful version of this model.
 - Model creator: [Meta](https://huggingface.co/meta-llama/)
 - Original model: [meta-llama/Meta-Llama-3.1-405B](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B)
 ## Quickstart
 Running the following on a desktop OS will launch a tab in your web
+browser. The smallest weights available are are Q2\_K which should work
+fine on systems with at least 150 GB of RAM. This llamafile needs to be
+downloaded in multiple files, due to HuggingFace's 50GB upload limit and
+then concatenated back together locally. Therefore you'll need at least
+400GB of free disk space.
 ```
+wget https://huggingface.co/Mozilla/Meta-Llama-3.1-405B-llamafile/resolve/main/Meta-Llama-3.1-405B.Q2_K.cat0.llamafile
+wget https://huggingface.co/Mozilla/Meta-Llama-3.1-405B-llamafile/resolve/main/Meta-Llama-3.1-405B.Q2_K.cat1.llamafile
+wget https://huggingface.co/Mozilla/Meta-Llama-3.1-405B-llamafile/resolve/main/Meta-Llama-3.1-405B.Q2_K.cat2.llamafile
+wget https://huggingface.co/Mozilla/Meta-Llama-3.1-405B-llamafile/resolve/main/Meta-Llama-3.1-405B.Q2_K.cat3.llamafile
+cat Meta-Llama-3.1-405B.Q2_K.cat{0,1,2,3}.llamafile >Meta-Llama-3.1-405B.Q2_K.llamafile
+rm Meta-Llama-3.1-405B.Q2_K.cat*.llamafile
+chmod +x Meta-Llama-3.1-405B.Q2_K.llamafile
+./Meta-Llama-3.1-405B.Q2_K.llamafile
 ```
 You can then use the completion mode of the GUI to experiment with this
 ```
 This model has a max context window size of 128k tokens. By default, a
+context window size of 4096 tokens is used. You can use a larger context
 window by passing the `-c 8192` flag. The software currently has
+limitations in its llama v3.1 support that may prevent scaling to the
+full 128k size.
 On GPUs with sufficient RAM, the `-ngl 999` flag may be passed to use
 the system's NVIDIA or AMD GPU(s). On Windows, only the graphics card