Update README.md
Browse files
README.md
CHANGED
|
@@ -39,7 +39,10 @@ through fine-tuning, but there is limited work targeting spatial reasoning.
|
|
| 39 |
|
| 40 |
## Main Results
|
| 41 |
|
| 42 |
-
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
# Training Data
|
| 45 |
|
|
@@ -108,7 +111,9 @@ fine-tuning task.
|
|
| 108 |
|
| 109 |
|
| 110 |
## Comparison Models
|
| 111 |
-
|
|
|
|
|
|
|
| 112 |
|
| 113 |
# Usage and Intended Uses
|
| 114 |
This model is designed to assist with natural language spatial reasoning, particularly in tasks that involve multi-step relational
|
|
|
|
| 39 |
|
| 40 |
## Main Results
|
| 41 |
|
| 42 |
+
The fine-tuned model slightly improved on general knowledge tasks such as MMLU Geography and Babi Task 17
|
| 43 |
+
compared to the original Mistral-7B base model. However, its performance on spatial reasoning benchmarks like SpatialEval
|
| 44 |
+
significantly declined, suggesting that fine-tuning may have led to incompatibility between the prompt style used for training with StepGame
|
| 45 |
+
and the multiple-choice formatting in SpatialEval.
|
| 46 |
|
| 47 |
# Training Data
|
| 48 |
|
|
|
|
| 111 |
|
| 112 |
|
| 113 |
## Comparison Models
|
| 114 |
+
LLaMA-2 and Gemma represent strong alternatives from Meta and Google respectively, offering diverse architectural approaches with a similar number of parameters and
|
| 115 |
+
training data sources. Including these models allowed for a more meaningful evaluation of how my fine-tuned model performs
|
| 116 |
+
not just against its own baseline, but also against state-of-the-art peers on spatial reasoning and general knowledge tasks.
|
| 117 |
|
| 118 |
# Usage and Intended Uses
|
| 119 |
This model is designed to assist with natural language spatial reasoning, particularly in tasks that involve multi-step relational
|