sareena's picture
Update README.md
9d71bd3 verified
|
raw
history blame
6.35 kB
metadata
license: apache-2.0
base_model:
  - mistralai/Mistral-7B-Instruct-v0.3

Introduction

Task Description

This model was fine-tuned to improve its ability to perform spatial reasoning tasks. The objective is to enable the model to interpret natural language queries related to spatial relationships, directions, and locations and output actionable responses. The task addresses limitations in current LLMs, which often fail to perform precise spatial reasoning, such as determining relationships between points on a map, planning routes, or identifying locations based on bounding boxes.

Task Importance

Spatial reasoning is important for a wide range of applications such as navigation and geospatial analysis. Many smaller LLMs, while strong in general reasoning, often lack the ability to interpret spatial relationships with precision or utilize real-world geographic data effectively. For example, they struggle to answer queries like “What’s between Point A and Point B?” or “Find me the fastest route avoiding traffic at 8 AM tomorrow.” I came across this limitation through my work, in which I am working on prompt engineering for an LLM project that has agentic behavior in calling a geocoding API. Even when the LLM has access to geospatial information, smaller models struggled to correctly interpret user questions, so we had to switch to a much newer and larger model.

Related Work/Gap Analysis

While there is ongoing research in integrating LLMs with geospatial systems, most existing solutions rely on symbolic AI or rule-based systems rather than leveraging the generalization capabilities of LLMs. Additionally, the paper “Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark,” concluded that larger models like GPT-4 perform well in mapping natural language descriptions to spatial relations but struggle with multi-hop reasoning. This paper used the StepGame as a benchmark for spatial reasoning. Fine-tuning a model fills the gap identified in the paper, as the only solutions identified in their research was prompt engineering with Chain of Thought.

Research by organizations like OpenAI and Google has focused on improving contextual reasoning through fine-tuning, but there is limited work targeting spatial reasoning.

Main Results

Training Data

For this fine-tuning task the step-game dataset was used. This dataset is large and provides multi-step reasoning challenges for geospatial reasoning. The train-test split is predefined with 50,000 rows in the train split and 10,000 in the test split. It focuses on multi-step problem-solving with spatial relationships, such as directional logic, relative positioning, and route-based reasoning. It presents text-based tasks that require stepwise deductions, ensuring the model develops strong reasoning abilities beyond simple fact recall. This dataset follows the template of story, question and answer to assess spatial reasoning as depicted below.

Description of StepGame Training Data

Training Method

For this task of spatial reasoning LoRA (Low-Rank Adaptation) was used as the training method. LoRA allows for efficient fine-tuning of large language models by freezing the majority of the model weights and only updating small, low-rank adapter matrices within attention layers. It significantly reduces the computational cost and memory requirements of full fine-tuning, making it ideal for working with limited GPU resources. LoRA is especially effective for task-specific adaptation when the dataset is moderately sized and instruction formatting is consistent as in the case of this dataset of stepGame. In previous experiments with spatial reasoning fine-tuning, LoRA performed better than prompt tuning. While prompt tuning resulted in close to 0% accuracy on both the StepGame and MMLU evaluations, LoRA preserved partial task performance (18% accuracy) and retained some general knowledge ability (46% accuracy on MMLU geography vs. 52% before training). I used a learning rate of 2e-4, batch size of 8, and trained for 2 epochs. This setup preserved general reasoning ability while improving spatial accuracy.

Evaluation

Usage and Intended Uses

from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("sareena/spatial_lora_mistral")
tokenizer = AutoTokenizer.from_pretrained("sareena/spatial_lora_mistral")

inputs = tokenizer("Q: The couch is to the left of the table. The lamp is on the couch. Where is the lamp?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Prompt Format

Expected Output Format

Limitations

Citation

  1. Hendrycks, Dan, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt.
    "Measuring Massive Multitask Language Understanding." arXiv preprint arXiv:2009.03300 (2020).

  2. Li, Fangjun, et al.
    “Advancing Spatial Reasoning in Large Language Models: An in-Depth Evaluation and Enhancement Using the StepGame Benchmark.”
    arXiv.Org, 8 Jan. 2024. https://arxiv.org/abs/2401.03991

  3. Mirzaee, Roshanak, Hossein Rajaby Faghihi, Qiang Ning, and Parisa Kordjamshidi.
    "SpartQA: A Textual Question Answering Benchmark for Spatial Reasoning." arXiv preprint arXiv:2104.05832 (2021).

  4. Shi, Zhengxiang, et al.
    “StepGame: A New Benchmark for Robust Multi-Hop Spatial Reasoning in Texts.”
    arXiv.Org, 18 Apr. 2022. https://arxiv.org/abs/2204.08292

  5. Shi, Zhengxiang, Qiang Zhang, and Aldo Lipani.
    "StepGame: A New Benchmark for Robust Multi-Hop Spatial Reasoning in Texts." arXiv preprint arXiv:2204.08292 (2022).

  6. Wang, Mila, Xiang Lorraine Li, and William Yang Wang.
    "SpatialEval: A Benchmark for Spatial Reasoning Evaluation." arXiv preprint arXiv:2104.08635 (2021).

  7. Weston, Jason, Antoine Bordes, Sumit Chopra, and Tomas Mikolov.
    "Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks." arXiv preprint arXiv:1502.05698 (2015).