AAAI 2026 (Oral)

TraveLLaMA: A Multimodal Travel Assistant with
Large-Scale Dataset and Structured Reasoning

Meng Chu¹, Yukang Chen², Haokun Gui¹, Shaozuo Yu², Yi Wang³, Jiaya Jia¹

¹Hong Kong University of Science and Technology ²Chinese University of Hong Kong ³Shanghai AI Laboratory

Paper Code TravelQA Dataset (Coming Soon) Demo (Coming Soon)

TraveLLaMA is an advanced multimodal AI travel assistant that seamlessly processes both text and image-based queries. The system enables travelers to plan trips efficiently by providing contextual responses including service information, localization details, and personalized recommendations based on visual inputs and textual questions.

Abstract

Tourism and travel planning increasingly rely on digital assistance, yet existing multimodal AI systems often lack specialized knowledge and contextual understanding of urban environments. We present TraveLLaMA, a specialized multimodal language model designed for comprehensive travel assistance. Our work addresses the fundamental challenge of developing practical AI travel assistants through three key contributions:

(1) TravelQA, a novel dataset of 265k question-answer pairs combining 160k text QA from authentic travel sources, 100k vision-language QA featuring maps and location imagery, and 5k expert-annotated Chain-of-Thought reasoning examples; (2) Travel-CoT, a structured reasoning framework that decomposes travel queries into spatial, temporal, and practical dimensions, improving answer accuracy by 10.8% while providing interpretable decision paths; and (3) an interactive agent system validated through extensive user studies.

Through fine-tuning experiments on state-of-the-art vision-language models, we achieve 6.2-9.4% base improvements, further enhanced by Travel-CoT reasoning. User studies with 500 participants show TraveLLaMA achieves a System Usability Scale score of 82.5, significantly outperforming general-purpose models.

TravelQA Dataset

The first large-scale multimodal travel dataset spanning 35+ cities worldwide

📊

265K

Total QA Pairs

📝

160K

Text-based QA

🖼️

100K

Vision-Language QA

🧠

Expert CoT Annotations

TravelQA Dataset Coverage: Our dataset spans major cities across North America, Asia, and Europe, covering six structured categories: Attractions (70k), Dining (52k), Living (39k), Transportation (26k), Cultural (39k), and Practical (34k) information. Visual elements include 40k map-based and 60k street-view image QA pairs.

Key Contributions

Three major innovations for advancing AI-powered travel assistance

TravelQA Dataset

The first large-scale multimodal travel dataset with 265k QA pairs: 160k text-based pairs from travel forums, 100k vision-language pairs with maps and photos, and 5k expert-annotated CoT reasoning examples across 35+ cities.

Travel-CoT Reasoning

A structured reasoning framework that decomposes queries into spatial, temporal, and practical dimensions. Beyond 6.2-9.4% base improvements, Travel-CoT achieves an additional 10.8% accuracy gain with interpretable reasoning paths.

Interactive Agent System

A ReAct-based agent that integrates real-time services for dynamic planning. Validated by 500 users with a SUS score of 82.5 (Excellent), demonstrating superior usability for complex travel planning tasks.

Travel-CoT: Structured Reasoning

Decomposing travel queries into interpretable reasoning dimensions

📍 Spatial Reasoning

⏰ Temporal Scheduling

💡 Practical Constraints

Travel-CoT Framework: Given multimodal input (x, Q), the model generates a reasoning chain r = {r_s, r_t, r_p}, where r_s encodes spatial understanding (locations, distances, routes), r_t encodes temporal scheduling (operating hours, time allocation), and r_p captures practical constraints (budget, accessibility, safety). The final answer is generated conditioned on both the input and the reasoning chain.

Agent Architecture

Real-time planning with iterative reasoning and tool integration

ReAct-Style Agent Pipeline: The agent processes multimodal travel requests through four stages: (1) Query Analysis extracts constraints and interprets visual inputs; (2) Reasoning applies Travel-CoT to organize requirements; (3) Tool Employment calls APIs for real-time information; (4) Result Integration generates detailed itineraries with budget calculations and constraint verification.

Experimental Results

Comprehensive evaluation on TravelQA benchmark

Method	LLM Backbone	Pure Text	VQA	Full Score	Δ Improvement
BLIP-2	Vicuna-13B	60.3	51.6	56.9	—
InstructBLIP	Vicuna-13B	64.6	55.4	61.1	—
Shikra	Vicuna-13B	71.6	60.8	67.5	—
Qwen-VL	Qwen-7B	72.1	61.6	68.1	—
LLaVA-1.5	Vicuna-13B	74.3	63.3	70.0	—
Fine-tuned on TravelQA
Qwen-VL (ft)	Qwen-7B	78.7	67.7	74.5	+9.4%
LLaVA-1.5 (ft)	Vicuna-13B	80.4	68.9	76.0	+8.6%
TraveLLaMA (Ours)	Vicuna-13B	82.5	70.5	77.8	+10.8%

User Study Results

System Usability Scale (SUS) evaluation with 500 participants

TraveLLaMA (Ours)

82.5

⭐ Excellent

Claude 3.5

76.3

Good

Our user study demonstrates that TraveLLaMA significantly outperforms general-purpose models in travel planning tasks. The 6.2-point SUS improvement is driven by domain-optimized design, reflected in strong ease-of-use, learnability, and reduced-complexity ratings. Users consistently found TraveLLaMA more intuitive and less cognitively demanding for travel planning.

Qualitative Comparison

TraveLLaMA vs. Claude 3.5 on real-world travel queries

Key Observations: TraveLLaMA consistently outperforms Claude 3.5 in accuracy and contextual understanding:

Map-based tasks: Precise location grounding and relevant nearby information
Scene recognition: Accurate identification of landmarks with meaningful contextual descriptions
Multimodal integration: Detailed, actionable establishment information
Complex queries: Proactive travel guidance including nearby attractions and optimal visiting times

📚 Citation

@inproceedings{chu2026travellama,
  title     = {TraveLLaMA: A Multimodal Travel Assistant with Large-Scale Dataset and Structured Reasoning},
  author    = {Chu, Meng and Chen, Yukang and Gui, Haokun and Yu, Shaozuo and Wang, Yi and Jia, Jiaya},
  booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
  year      = {2026}
}

TraveLLaMA: A Multimodal Travel Assistant with Large-Scale Dataset and Structured Reasoning

🎬 Demo Video

Abstract

TravelQA Dataset

Key Contributions

TravelQA Dataset

Travel-CoT Reasoning

Interactive Agent System

Travel-CoT: Structured Reasoning

Agent Architecture

Experimental Results

User Study Results

Qualitative Comparison

📚 Citation

TraveLLaMA: A Multimodal Travel Assistant with
Large-Scale Dataset and Structured Reasoning