AAAI 2026 (Oral)

TraveLLaMA: A Multimodal Travel Assistant with
Large-Scale Dataset and Structured Reasoning

1Hong Kong University of Science and Technology    2Chinese University of Hong Kong    3Shanghai AI Laboratory
TraveLLaMA Overview

TraveLLaMA is an advanced multimodal AI travel assistant that seamlessly processes both text and image-based queries. The system enables travelers to plan trips efficiently by providing contextual responses including service information, localization details, and personalized recommendations based on visual inputs and textual questions.

Abstract

Tourism and travel planning increasingly rely on digital assistance, yet existing multimodal AI systems often lack specialized knowledge and contextual understanding of urban environments. We present TraveLLaMA, a specialized multimodal language model designed for comprehensive travel assistance. Our work addresses the fundamental challenge of developing practical AI travel assistants through three key contributions:

(1) TravelQA, a novel dataset of 265k question-answer pairs combining 160k text QA from authentic travel sources, 100k vision-language QA featuring maps and location imagery, and 5k expert-annotated Chain-of-Thought reasoning examples; (2) Travel-CoT, a structured reasoning framework that decomposes travel queries into spatial, temporal, and practical dimensions, improving answer accuracy by 10.8% while providing interpretable decision paths; and (3) an interactive agent system validated through extensive user studies.

Through fine-tuning experiments on state-of-the-art vision-language models, we achieve 6.2-9.4% base improvements, further enhanced by Travel-CoT reasoning. User studies with 500 participants show TraveLLaMA achieves a System Usability Scale score of 82.5, significantly outperforming general-purpose models.

TravelQA Dataset

The first large-scale multimodal travel dataset spanning 35+ cities worldwide

📊
265K
Total QA Pairs
📝
160K
Text-based QA
🖼️
100K
Vision-Language QA
🧠
5K
Expert CoT Annotations
TravelQA Dataset Overview
TravelQA Dataset Coverage: Our dataset spans major cities across North America, Asia, and Europe, covering six structured categories: Attractions (70k), Dining (52k), Living (39k), Transportation (26k), Cultural (39k), and Practical (34k) information. Visual elements include 40k map-based and 60k street-view image QA pairs.

Key Contributions

Three major innovations for advancing AI-powered travel assistance

TravelQA Dataset

The first large-scale multimodal travel dataset with 265k QA pairs: 160k text-based pairs from travel forums, 100k vision-language pairs with maps and photos, and 5k expert-annotated CoT reasoning examples across 35+ cities.

Travel-CoT Reasoning

A structured reasoning framework that decomposes queries into spatial, temporal, and practical dimensions. Beyond 6.2-9.4% base improvements, Travel-CoT achieves an additional 10.8% accuracy gain with interpretable reasoning paths.

Interactive Agent System

A ReAct-based agent that integrates real-time services for dynamic planning. Validated by 500 users with a SUS score of 82.5 (Excellent), demonstrating superior usability for complex travel planning tasks.

Travel-CoT: Structured Reasoning

Decomposing travel queries into interpretable reasoning dimensions

📍 Spatial Reasoning
Temporal Scheduling
💡 Practical Constraints
Travel-CoT Method
Travel-CoT Framework: Given multimodal input (x, Q), the model generates a reasoning chain r = {rs, rt, rp}, where rs encodes spatial understanding (locations, distances, routes), rt encodes temporal scheduling (operating hours, time allocation), and rp captures practical constraints (budget, accessibility, safety). The final answer is generated conditioned on both the input and the reasoning chain.

Agent Architecture

Real-time planning with iterative reasoning and tool integration

Agent Architecture
ReAct-Style Agent Pipeline: The agent processes multimodal travel requests through four stages: (1) Query Analysis extracts constraints and interprets visual inputs; (2) Reasoning applies Travel-CoT to organize requirements; (3) Tool Employment calls APIs for real-time information; (4) Result Integration generates detailed itineraries with budget calculations and constraint verification.

Experimental Results

Comprehensive evaluation on TravelQA benchmark

Method LLM Backbone Pure Text VQA Full Score Δ Improvement
BLIP-2 Vicuna-13B 60.3 51.6 56.9
InstructBLIP Vicuna-13B 64.6 55.4 61.1
Shikra Vicuna-13B 71.6 60.8 67.5
Qwen-VL Qwen-7B 72.1 61.6 68.1
LLaVA-1.5 Vicuna-13B 74.3 63.3 70.0
Fine-tuned on TravelQA
Qwen-VL (ft) Qwen-7B 78.7 67.7 74.5 +9.4%
LLaVA-1.5 (ft) Vicuna-13B 80.4 68.9 76.0 +8.6%
TraveLLaMA (Ours) Vicuna-13B 82.5 70.5 77.8 +10.8%

User Study Results

System Usability Scale (SUS) evaluation with 500 participants

TraveLLaMA (Ours)
82.5
⭐ Excellent
Claude 3.5
76.3
Good

Our user study demonstrates that TraveLLaMA significantly outperforms general-purpose models in travel planning tasks. The 6.2-point SUS improvement is driven by domain-optimized design, reflected in strong ease-of-use, learnability, and reduced-complexity ratings. Users consistently found TraveLLaMA more intuitive and less cognitively demanding for travel planning.

Qualitative Comparison

TraveLLaMA vs. Claude 3.5 on real-world travel queries

Qualitative Comparison

Key Observations: TraveLLaMA consistently outperforms Claude 3.5 in accuracy and contextual understanding:

  • Map-based tasks: Precise location grounding and relevant nearby information
  • Scene recognition: Accurate identification of landmarks with meaningful contextual descriptions
  • Multimodal integration: Detailed, actionable establishment information
  • Complex queries: Proactive travel guidance including nearby attractions and optimal visiting times

📚 Citation

@inproceedings{chu2026travellama,
  title     = {TraveLLaMA: A Multimodal Travel Assistant with Large-Scale Dataset and Structured Reasoning},
  author    = {Chu, Meng and Chen, Yukang and Gui, Haokun and Yu, Shaozuo and Wang, Yi and Jia, Jiaya},
  booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
  year      = {2026}
}