0% found this document useful (0 votes)
52 views101 pages

LLM For Recommandation

LLM for Recommandation

Uploaded by

민냥
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views101 pages

LLM For Recommandation

LLM for Recommandation

Uploaded by

민냥
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

Tutorial on Large Language Models for

Recommendation

Wenyue Hua Lei Li Shuyuan Xu Li Chen Yongfeng Zhang


Rutgers University Hong Kong Baptist University Rutgers University Hong Kong Baptist University Rutgers University
[email protected] [email protected] [email protected] [email protected] [email protected]

1
Outline
• Background and Introduction
• Large Language Models for Recommendation
• Trustworthy LLMs for Recommendation
• Hands-on Demo of LLM-RecSys Development based on OpenP5
• Summary

2
Recommender Systems are Everywhere
• Influence our daily life by providing personalized services

E-commerce Social Networks News Feeding

Search Engine Navigation Travel Planning

Professional Networks Healthcare Online Education

3
Technical Advancement of Recommender Systems
• From Shallow Model, to Deep Model, and to Large Model
Shallow Deep Large
Models Models Models
e.g. Matrix Factorization [1] e.g. Deep & Wide NN [2] e.g. P5 [3]

[1] Koren, Yehuda, Robert Bell, and Chris Volinsky. "Matrix factorization techniques for recommender systems." Computer 42, no. 8 (2009): 30-37.
[2] Cheng, Heng-Tze, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson et al. "Wide & deep learning for recommender systems.” DLRS 2016. 4
[3] Geng, Shijie, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. "Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5)." RecSys 2022.
Objective AI vs. Subjective AI
• Recommendation is unique in the AI family
• Recommendation is most close to human among all AI tasks
• Recommendation is a very representative Subjective AI
• Thus, leads to many unique challenges in recommendation research

Objective AI Subjective AI

Computer Vision NLP Recommendation

(Relatively) far from human. Very close to human.


Problems have exact answers. Problems have no absolute answers.

5
Computer Vision: (mostly) Objective AI Tasks
Objective AI Subjective AI

Computer Vision NLP Recommendation

Image Classification Image Segmentation Object Detection

cat dog

Husky like a wolf

6
NLP: partly Objective, partly Subjective
Objective AI Subjective AI

Computer Vision NLP Recommendation

Syntactic Analysis Dialog Systems

Word Segmentation

7
Recommendation: mostly Subjective AI Tasks
Objective AI Subjective AI

Computer Vision NLP Recommendation

Movie Recommendation Product Recommendation

Recommend Recommend

8
Recommendation is not only about Item Ranking
• A diverse set of recommendation tasks
• Rating Prediction
• Item Ranking
• Sequential Recommendation
• User Profile Construction
• Review Summarization
• Explanation Generation
• Fairness Consideration
• etc.

9
Example: Subjective AI needs Explainability
• Objective vs. Subjective AI on Explainability
Objective AI Subjective AI
Human can directly identify if the Human can hardly identify if the AI-produced result is right or wrong. Users are very
AI-produced result is right or wrong vulnerable, could be manipulated, utilized or even cheated by the system

cat dog

Nothing is definitely
right or wrong.

Highly subjective, and


usually personalized.

10
Example: Subjective AI needs Explainability
• In many cases, it doesn’t matter what you recommend, but how you explain your recommendation
• How do humans make recommendation?

is
me nd t h end t
hi s
m ! m
I reco no reason Ah! m
I reco because…
,
Why? movie movie
,

11
Can we Handle all RecSys tasks Together?
• A diverse set of recommendation tasks
• Rating Prediction
• Item Ranking
• Sequential Recommendation
• User Profile Construction
• Review Summarization
• Explanation Generation
• Fairness Consideration
• etc.
• Do we really need to design thousands of recommendation models?
• Difficult to integrate so many models in industry production environment

12
A Bird’s View of Traditional RecSys
• The Multi-Stage Filtering RecSys Pipeline

Youtube recommendation engine

Image credit to [1] Image credit to [2]

13 2020.
[1] Jiang, Biye, Pengye Zhang, Rihan Chen, Xinchen Luo, Yin Yang, Guan Wang, Guorui Zhou, Xiaoqiang Zhu, and Kun Gai. "DCAF: A Dynamic Computation Allocation Framework for Online Serving System." DLP-KDD
[2] Covington, Paul, Jay Adams, and Emre Sargin. "Deep neural networks for youtube recommendations." In Proceedings of the 10th ACM conference on recommender systems, pp. 191-198. 2016.
Discriminative Ranking
• User-item matching based on embeddings

Image credit to [1]

Matching Models Sequential Models Reasoning Models

• Discriminative ranking loss function


• e.g., Bayesian Personalized Ranking (BPR) loss
𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 𝑤ℎ𝑒𝑟𝑒: 𝑥!!"# = 𝑝! 𝑞"$ − 𝑝! 𝑞#$
14
[1] Chen, Hanxiong, Shaoyun Shi, Yunqi Li, and Yongfeng Zhang. "Neural collaborative reasoning." In Proceedings of the Web Conference 2021, pp. 1516-1527. 2021.
Problem with Discriminative Ranking
• Huge numbers of users and items
• Amazon: 300 million customers, 350 million products*
• YouTube: 2.6+ billion monthly active users, 5+ billion videos**
• We have to use multi-stage filtering: Simple rules are used at early stages,
advanced algorithms are only applied to a small number of items at later stages

• Too many candidate items, difficult for evaluation


• Many research papers use sampled evaluation: 1-in-100, 1-in-1000, etc.
*https://sell.amazon.com/blog/amazon-stats, and https://www.bigcommerce.com/blog/amazon-statistics/ 15
**https://www.globalmediainsight.com/blog/youtube-users-statistics/
Large Language Models (LLMs)
• Auto-regressive decoding for generative prediction

Image credit to [1] Image credit to [2]

[1] Sanh, Victor, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin et al. "Multitask prompted training enables zero-shot task generalization.” ICLR 2022. 16
[2] Yang, Jingfeng, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Bing Yin, and Xia Hu. "Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond." arXiv preprint arXiv:2304.13712 (2023).
Generative Pre-training and Prediction
• Generative Pre-training
• Generative Loss Function
• Use the previous tokens to predict next token

• Generative Prediction
• Beam Search
Image credit to [1]
• Using finite tokens to represent (almost) infinite items
• e.g., 100 vocabulary tokens, ID size 10 => #items = 100^10=10^20 Image credit to [2]

• # of candidate tokens at each beam is bounded

• No longer need one-by-one candidate score


calculation as in discriminative ranking

• Directly generate the item ID to recommend


17 30 (2017).
[1] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." Advances in neural information processing systems
[2] https://d2l.ai/chapter_recurrent-modern/beam-search.html
Generative Ranking
• From Multi-stage ranking to Single-stage ranking
• The model automatically considers all items as the candidate pool
• Fixed-size item decoding
• e.g., using 100 tokens ⟨00⟩⟨01⟩…⟨99⟩ for item ID representation
⟨23⟩ ⟨68⟩ ⟨/s⟩

Given the interaction history of user_235: item_5678, item_8265, item_521, ⟨s⟩ ⟨23⟩ ⟨68⟩
item_2235, item_750, what to recommend next for the user?
Answer: item_2368

18
Generative Recommendation with Beam Search
● Since item IDs are tokenized (e.g., [“item”, “_”, “73”, “91”]), beam search
is bounded on width
● E.g., 100 tokens width: ⟨00⟩, ⟨01⟩, ⟨02⟩, …, ⟨98⟩, ⟨99⟩
● Assigning an item a token as in traditional recommendation is infeasible for LLM
● Consume a lot of memory and computationally expensive

19
[1] Li, Lei, Yongfeng Zhang, Dugang Liu, and Li Chen. "Large Language Models for Generative Recommendation: A Survey and Visionary Discussions." arXiv preprint arXiv:2309.01157 (2023).
Large Language Models for
Recommendation

20
How to Categorize LLM-based Recommendation
• Whether to Fine-tune LLM for Recommendation or Not
• With Fine-tuning [1]
• Without Fine-tuning [2]
• The Role of LLM in Recommendation
• LLM as RecSys [1]
• LLM in RecSys [3]
• e.g., LLM as a feature extractor for recommender systems
• RecSys in LLM [4]
• e.g., LLM-based Agents, where RecSys is used as on of the tools
• Typical Recommendation Tasks [1]
• Rating Prediction, Sequential Recommendation, Direct Recommendation, ...
[1] Geng, Shijie, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. "Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5)" RecSys 2022.
[2] Liu, Junling, Chao Liu, Renjie Lv, Kang Zhou, and Yan Zhang. "Is chatgpt a good recommender? a preliminary study." arXiv preprint arXiv:2304.10149 (2023).
[3] Lin, Jianghao, et al. "How Can Recommender Systems Benefit from Large Language Models: A Survey." arXiv preprint arXiv:2306.05817 (2023).
[4] Wang, Yancheng, Ziyan Jiang, Zheng Chen, Fan Yang, Yingxue Zhou, Eunah Cho, Xing Fan, Xiaojiang Huang, Yanbin Lu, and Yingzhen Yang. "Recmind: Large language model powered agent for 21
recommendation." arXiv preprint arXiv:2308.14296 (2023).
Two Broad Categories of Recommendation Tasks
Prediction Tasks Generation Tasks

Image credit to [1]

22
[1] Fan, Wenqi, et al. "Recommender systems in the era of large language models (llms)." arXiv preprint arXiv:2307.02046 (2023).
Typical Recommendation Tasks
• LLM usually can perform multiple recommendation tasks
• e.g., P5 [2], POD [3], InstructRec [4]

Image credit to [1]

[1] Li, Lei, Yongfeng Zhang, Dugang Liu, and Li Chen. "Large Language Models for Generative Recommendation: A Survey and Visionary Discussions." arXiv preprint arXiv:2309.01157 (2023).
[2] Geng, Shijie, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. "Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5)." RecSys 2022.
[3] Li, Lei, Yongfeng Zhang, and Li Chen. "Prompt Distillation for Efficient LLM-based Recommendation." CIKM 2023.
[4] Zhang, Junjie, et al. "Recommendation as instruction following: A large language model empowered recommendation approach." arXiv preprint arXiv:2305.07001 (2023).
23
The P5 Generative Recommendation Paradigm
• P5: Pretrain, Personalized Prompt & Predict Paradigm [1]

● Learns multiple recommendation tasks


together through a unified sequence-to-
sequence framework

● Formulates different recommendation


problems as prompt-based natural
language tasks

● User-item information and corresponding


features are integrated with personalized
prompts as model inputs

24
[1] Geng, Shijie, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. "Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5)." RecSys 2022.
Personalization in Prompts
● Definition of personalized prompts
○ A prompt that includes personalized fields for different users and items

● User’s preference can be indicated through


○ A user ID (e.g., “user_23”)
○ Content description of the user such as location, preferred movie genres, etc.

● Item field can be represented by


○ An item ID (e.g., “item_7391”)

○ Item content metadata that contains detailed descriptions of the item, e.g., item category

25
Personalized Prompt Design

26
Design Multiple Prompts for Each Task
• To enhance variation in language style (e.g., sequential recommendation)

27
Multi-Task Pre-training

28
Multi-Task Pre-training

● P5 is pre-trained on top of T5 checkpoints (to enable basic ability for language understanding)

● By default, P5 uses multiple sub-word units to represent personalized fields (e.g., [“item”, “_”, “73”, “91”])

29
Generative Recommendation with Beam Search
● The encoder takes input sequence
● The decoder autoregressively generates next words:

○ Autoregressive LM loss is shared by all tasks:

● P5 can unify various recommendation tasks with


one model, one loss, and one data format

● Inference with pretrained P5


○ Simply apply beam search to generate a list of potential next items Image credit to [1]
○ Beam size set to N (N candidates)

30
[1] https://d2l.ai/chapter_recurrent-modern/beam-search.html
Generative Recommendation with Beam Search
● Since item IDs are tokenized (e.g., [“item”, “_”, “73”, “91”]), beam search
is bounded on width
● E.g., 100 tokens width: ⟨00⟩, ⟨01⟩, ⟨02⟩, …, ⟨98⟩, ⟨99⟩
● Assigning an item a token as in traditional recommendation is infeasible for LLM
● Consume a lot of memory and computationally expensive

31
[1] Li, Lei, Yongfeng Zhang, Dugang Liu, and Li Chen. "Large Language Models for Generative Recommendation: A Survey and Visionary Discussions." arXiv preprint arXiv:2309.01157 (2023).
Advantages of P5 Generative Recommendation
• Immerses recommendation models into a full language environment
• With the flexibility and expressiveness of language, there is no need to design
feature-specific encoders

• P5 treats all personalized tasks as a conditional text generation problem


• One data format, one model, one loss for multiple recommendation tasks
• No need to design data-specific or task-specific recommendation models

• P5 attains sufficient zero-shot performance when generalizing to novel


personalized prompts or unseen items in other domains

32
Performance of P5 under seen Prompts
Rating Prediction: Sequential Recommendation:

Explanation Generation:

33
Performance of P5 under seen Prompts
Review Summarization:

Direct Recommendation:

Observation: P5 achieves promising performances on the five task families when taking seen prompt templates as model inputs
34
Performance of P5 under unseen Prompts
Observation: Multitask prompted pretraining empowers P5 good robustness to understand unseen
prompts with wording variations
Sequential Recommendation: Explanation Generation:

Direct Recommendation:

35
Easy Handling of Multi-modality Information
• Item images can be directly inserted into personalized prompts [1]

36
[1] Geng, Shijie, Juntao Tan, Shuchang Liu, Zuohui Fu, and Yongfeng Zhang. "VIP5: Towards Multimodal Foundation Models for Recommendation." EMNLP 2023.
Easy Handling of Multi-modality Information
• Item images can be converted into visual tokens

37
[1] Geng, Shijie, Juntao Tan, Shuchang Liu, Zuohui Fu, and Yongfeng Zhang. "VIP5: Towards Multimodal Foundation Models for Recommendation." EMNLP 2023.
Easy Handling of Multi-modality Information
• Item images can be directly inserted into prompts
• Multi-modality information further improves performance

Sequential Recommendation Performance Direct Recommendation Performance

38
ChatGPT as Recommender
• Instruct ChatGPT to perform different tasks w/o fine-tuning
• Few-shot or zero-shot settings (w/ or w/o demonstration examples)

39
[1] Liu, Junling, et al. "Is chatgpt a good recommender? a preliminary study." arXiv preprint arXiv:2304.10149 (2023).
ChatGPT on Recommendation Tasks
• Recommendation performance is relatively weak
Direct Recommendation
Sequential Recommendation

Rating Prediction

40
[1] Liu, Junling, et al. "Is chatgpt a good recommender? a preliminary study." arXiv preprint arXiv:2304.10149 (2023).
ChatGPT on Generation Tasks
• Performance with automatic metrics is bad
• Rated highly by human evaluators
• Existing metrics (BLEU and ROUGE) overly stress the matching between generation
and ground-truth [2]
Explanation Generation Review Summarization

[1] Liu, Junling, et al. "Is chatgpt a good recommender? a preliminary study." arXiv preprint arXiv:2304.10149 (2023). 41
[2] Wang, Xiaolei, et al. "Rethinking the Evaluation for Conversational Recommendation in the Era of Large Language Models." arXiv preprint arXiv:2305.13112 (2023).
ChatGPT as Recommender
• ChatGPT on three types of recommendation w/o fine-tuning
• Point-wise (rate), pair-wise (compare), list-wise (rank)

42
[1] Dai, Sunhao, et al. "Uncovering ChatGPT's Capabilities in Recommender Systems." RecSys 2023.
Recommendation Performance of ChatGPT
• Outperform weak baselines on the three recommendation tasks
• Random, pop

43
[1] Dai, Sunhao, et al. "Uncovering ChatGPT's Capabilities in Recommender Systems." RecSys 2023.
With Fine-tuning or Without Fine-tuning
• Without fine-tuning, LLM cannot easily solve RS problems
• RS is a highly specialized area that requires collaborative knowledge,
which LLM did not learn during the pre-training stage [1]
• Collaborative knowledge such as user behavior data is highy dynamic
• RS practitioners do not have an existential crisis as NLP
community
• Many NLP problems can be easily addressed by LLM
• RS is still an open problem and will evolve with LLM

44
[1] Lin, Jianghao, et al. "How Can Recommender Systems Benefit from Large Language Models: A Survey." arXiv preprint arXiv:2306.05817 (2023).
Role of LLM in Recommendation
• LLM as RS
• E.g., P5 and ChatGPT-based recommenders
• LLM in RS as a component

Image credit to [1]


Image credit to [2]

[1] Wu, Likang, et al. "A Survey on Large Language Models for Recommendation." arXiv preprint arXiv:2305.19860 (2023). 45
[2] Lin, Jianghao, et al. "How Can Recommender Systems Benefit from Large Language Models: A Survey." arXiv preprint arXiv:2306.05817 (2023).
LLM as Feature Encoder
• LLM is grounded to recommendation
space by generating tokens for items
• Then these tokens are grounded
to actual items in the actual item
space Image credit to [2]

Image credit to [1]


[1] Bao, Keqin, et al. "A bi-step grounding paradigm for large language models in recommendation systems." arXiv preprint arXiv:2308.08434 (2023). 46
[2] Wu, Likang, et al. "A Survey on Large Language Models for Recommendation." arXiv preprint arXiv:2305.19860 (2023).
LLM as Feature Encoder
• Instruct LLM to generate search queries
• Then a searching algorithm is applied to retrieve items based
on the queries

47
[1] Li, Jinming, et al. "GPT4Rec: A generative framework for personalized recommendation and user interests interpretation." arXiv preprint arXiv:2304.03879 (2023).
LLM as Scoring Function
• Instruct LLM to generate a binary score (like or dislike) for each
item
• Discriminative as traditional recommenders

48
[1] Bao, Keqin, et al. "Tallrec: An effective and efficient tuning framework to align large language model with recommendation." arXiv preprint arXiv:2305.00447 (2023).
LLM as Ranking Function
• Provide LLM with candidates from another RS for re-ranking

Chain of thought
1. Preference inference
2. Preferred item
selection
3. Recommendation

Image credit to NIR [1] Image credit to PALR [2]


[1] Wang, Lei, and Ee-Peng Lim. "Zero-Shot Next-Item Recommendation using Large Pretrained Language Models." arXiv preprint arXiv:2304.03153 (2023). 49
[2] Chen, Zheng. "PALR: Personalization Aware LLMs for Recommendation." Gen-IR@SIGIR 2023: The First Workshop on Generative Information Retrieval (2023).
LLM as Ranking Function
• LLM takes candidates from a Recall model for re-ranking
• Design prompts for different recommendation settings

50
[1] Zhang, Junjie, et al. "Recommendation as instruction following: A large language model empowered recommendation approach." arXiv preprint arXiv:2305.07001 (2023).
LLM as Pipeline Controller
• Break each task into several planning steps
• Thought, action and observation
• Control personalized memory and world knowledge
• Perform specific tasks with tools, e.g., task-specific models

51
[1] Wang, Yancheng, et al. "RecMind: Large Language Model Powered Agent For Recommendation." arXiv preprint arXiv:2308.14296 (2023).
Recommendation Tasks
• Rating Prediction
• Sequential Recommendation
• Top-N Recommendation
• Explanation Generation
• Review Summarization
• Review Generation
• Conversational Recommendation

52
Conversational Recommendation
• LLM as the whole conversational recommender
• T: Task description
• F: Format requirement
• S: Conversational context

Image credit to [1]

[1] He, Zhankui, et al. "Large Language Models as Zero-Shot Conversational Recommenders." CIKM 2023. 53
[2] Cui, Zeyu, et al. "M6-rec: Generative pretrained language models are open-ended recommender systems." arXiv preprint arXiv:2205.08084 (2022).
Conversational Recommendation
• LLM as dialogue manager that merges various types of info
• Recommendations (from another model)
• Dialogue history

54
[1] Gao, Yunfan, et al. "Chat-rec: Towards interactive and explainable llms-augmented recommender system." arXiv preprint arXiv:2303.14524 (2023).
Conversational Recommendation
• Multiple LLMs play separate roles
• Dialogue Manager
Dialogue Manager
• Ranking Function
• User Simulator

55
[1] Friedman, Luke, et al. "Leveraging Large Language Models in Conversational Recommender Systems." arXiv preprint arXiv:2305.07961 (2023).
Evaluation Protocols
• Recommendation
• RMSE and MAE for rating prediction
• NDCG, Precision and Recall for top-N and sequential recommendation
• Online A/B test
• Generation
• BLEU and ROUGE for text similarity
• Overly stress the matching between generation and ground-truth [1]
• Advanced metrics are needed
• Human Evaluation

56
[1] Wang, Xiaolei, et al. "Rethinking the Evaluation for Conversational Recommendation in the Era of Large Language Models." arXiv preprint arXiv:2305.13112 (2023).
Trustworthy LLMs for
Recommendation

57
Trustworthy LLM4RS
• Hallucination (item ID indexing)
• Fairness
• Transparency
• Robustness
• Controllability
• etc.

58
Fairness Transparency

Hallucination

Robustness

Controllability

59
Hallucination: Item Generation
• LLM-based Generative Recommendation Paradigm
• We want to directly generate the recommended item
• Avoid one-by-one ranking score calculation

• However, item descriptions can be very long


• e.g., product description: >100 words
• e.g., news article: >1,000 words

60
Hallucination: Item Generation
• Generating long text is difficult, especially for recommendation
• Hallucination problem
• Generated text does not correspond to a real existing item in database
• Calculating similarity between generated text and item text?
• Goes back to one-by-one similarity calculation for ranking!

• Item ID: A short sequence of tokens for an item


• Easy generation, and can be indexed!

• Item ID can take various forms


• A sequence of numerical tokens <73><91><26>
• A sequence of word tokens <the><lord><of><the><rings>

61
Why Item IDs can eliminate hallucination?

With item indices consisting of a


limited vocabulary and known
structure, we can constrain the
beam search over limited allowed
tokens for every generation step.

Thus, hallucination will be eliminated.

62
picture credited to: Li, Lei, et al. "Large Language Models for Generative Recommendation: A Survey and Visionary Discussions." arXiv preprint arXiv:2309.01157 (2023).
How to Index Items?
• Item ID: item needs to be represented as a sequence of tokens
• e.g., an item represented by two tokens <73> <91>

• Different item indexing gives very different performance

63
How to Index Items (create Item IDs)
• Three properties for good item indexing methods
• Items are distinguishable (different items have different IDs)
• Similar items have similar IDs (more shared tokens in their IDs)
• Dissimilar items have dissimilar IDs (less shared tokens in their IDs)
• Three naïve Indexing methods
• Random ID (RID): Item ⟨73⟩⟨91⟩, item ⟨73⟩⟨12⟩, …
• Title as ID (TID): Item ⟨the⟩⟨lord⟩⟨of⟩⟨the⟩⟨rings⟩, …
• Independent ID (IID): Item ⟨1364⟩, Item ⟨6321⟩, …

64
How to Index Items (create Item IDs)
• Three naïve Indexing methods
• Random ID (RID): Item <73><91>, item <73><29>, …
• Very different items may share the same tokens
• Mistakenly making them semantically similar

• Title as ID (TID): Item <the><lord><of><the><rings>


• Very different movies may share similar titles
• Inside Out (animation) and Inside Job (documentary)
• The Lord of the Rings (epic fantasy) and The Lord of War (crime drama)

• Independent ID (IID): Item <1364>, Item <6321>, …


• Too many out-of-vocabulary (OOV) new tokens need to learn
• Computationally unscalable

65
Meticulous Item Indexing Methods are Needed

LLM4RS

66
Sequential Indexing (SID)
• Leverage the local co-appearance information between items

• After tokenization, co-appearing items share similar tokens


• Item 1004: <100><4>
• Item 1005: <100><5>

67
Collaborative Indexing (CID)
• Leverage the global co-appearance information between items
• Spectral Matrix Factorization over the item-item co-appearance matrix
• Hierarchical Spectral Clustering

68
Collaborative Indexing (CID)
• Leverage the global co-appearance information between items
• Root-to-Leaf Path-based Indexing
• Items in the same cluster share more tokens

69
Semantic (Content-based) Indexing (SemID)
• Leverage the item content information for item indexing
• e.g., the multi-level item category information in Amazon

70
Hybrid Indexing (HID)
• Concatenate more than one of the following indices
• Random ID (RID)
• Title as ID (TID)
• Independent ID (IID)
• Sequential ID (SID)
• Collaborative ID (CID)
• Semantic ID (SemID)

• For example, if an item’s Semantic ID and Collaborative ID are as follows:


• SemID: ⟨Makeup⟩⟨Lips⟩⟨Lip_Liners⟩⟨5⟩
• CID: ⟨1⟩⟨9⟩⟨5⟩⟨4⟩
• Then its Hybrid ID is ⟨Makeup⟩⟨Lips⟩⟨Lip_Liners⟩⟨1⟩⟨9⟩⟨5⟩⟨4⟩

71
Different Item Indexing Gives Different Performance

Naïve indexing
methods
Advanced indexing
methods

Hybrid indexing
methods

• Advanced indexing methods are better than naïve methods


• Some hybrid indexing can further improve performance 72
Fairness of LLM for Recommendation

1. Fairness of general LLM on critical domains (education, criminology,


finance and healthcare) [1]

2. User-side fairness: UP5 [2], FaiRLLM benchmark [3]

3. Item-side fairness: popularity bias [4]

[1] Li, Yunqi, et al. "Fairness of ChatGPT." arXiv preprint arXiv:2305.18569 (2023).

[2] Hua, Wenyue, et al. "UP5: Unbiased Foundation Model for Fairness-aware Recommendation." arXiv preprint arXiv:2305.12090 (2023).

[3] Zhang, Jizhi, et al. "Is chatgpt fair for recommendation? evaluating fairness in large language model recommendation." arXiv preprint arXiv:2305.07609 (2023).

[4] Hou, Yupeng, et al. "Large language models are zero-shot rankers for recommender systems." arXiv preprint arXiv:2305.08845 (2023).
Fairness of General LLM
• Fairness of ChatGPT on four critical domains [1]
• Education, Criminology, Finance and Healthcare
• Four Datasets
• PISA (education), COMPAS (criminology)
• German Credit (finance), Heart Disease (healthcare)
• Five Fairness Evaluation Dimensions
• Statistical Parity
• Equal Opportunity
• Equalized Odds
• Overall Accuracy Equality
• Counterfactual Fairness
• Main Observation
• ChatGPT is fairer than small models such as regression and MLP
classifier, though ChatGPT still has unfairness issues
74
[1] Li, Yunqi, et al. "Fairness of ChatGPT." arXiv preprint arXiv:2305.18569 (2023).
User-side Fairness method
Users want to be treated fairly, independent on their sensitive user features.

Are pretrained LLM4RS fair on recommending items?

[1] Li, Yunqi, et al. "Towards personalized fairness based on causal notion." Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 75
2021.
How to make sure recommendations are fair?
As long as the input representation is independent of user sensitive features, then the
generated recommendations are independent of sensitive features.

The AUC scores on various user features show that the user sensitive features are
incorporated in the input representations, leading to unfair recommendation.
76
Fairness Prompts for LLM

For each feature k, the adversarial loss is:

77
Single-feature fairness results

78
Fairness on multiple features
Users may require recommendation fairness on multiple features.
Do we retrain a fairness prompt on each feature combination?

79
Prompt Mixture

Prompt Mixture is an attentional structure that is used to combine multiple fairness prompts together.

80
Fairness on multiple features

81
User-side Fairness Benchmark: FaiRLLM

82
Unfairness on ChatGPT for recommendation system

X-axis: number of recommended items


Y-axis: similarity score compared with neutral instruction recommendation result
Conclusion: ChatGPT is not user-side fair
83
Item-side Fairness on LLM4RS: popularity bias

X-axis: position of the ranked item


lists.

Y-axis: item popularity score


(measured by the normalized item
frequency of appearance in the
training set)

Conclusion: Popular items tend to be


ranked at higher positions.

84
Item-side Fairness on LLM4RS: popularity bias

X-axis: the number of historical


interactions decreases in prompt

Y-axis: popularity scores (measured


by normalized item frequency) of the
best-ranked items.

Conclusion: the number of


interactions in prompt decreases, the
popularity score decreases along

85
Trustworthy LLM4RS
• Hallucination (item ID indexing)
• Fairness
• Transparency
• Robustness
• Controllability
• etc.

86
Transparency
Main idea: Given a GPT-2 neuron, leverage GPT-4 to generate an explanation of its
behavior by showing relevant text sequences and activations

[1] Bills, Steven, et al. "Language models can explain neurons in language models." URL https://openaipublic. blob. core. windows. net/neuron-explainer/paper/index. html.(Date accessed: 87
14.05. 2023) (2023).
Robustness
Robustness evaluation of different foundation models

Its show that ChatGPT shows consistent advantage on adversarial and OOD tasks. However, its
absolute performance is far from perfection, indicating much room for improvement.
88
[1] Wang, Jindong, et al. "On the robustness of chatgpt: An adversarial and out-of-distribution perspective." arXiv preprint arXiv:2302.12095 (2023).
Controllability
Controllable text generation: user can denote the style, content, or specific
attribute to include in text.

89
[1] Zhang, Hanqing, et al. "A survey of controllable text generation using transformer-based pre-trained language models." ACM Computing Surveys (2022).
A Hands-on Demo of LLM-RecSys
Development based on OpenP5

90
OpenP5
• An open-source platform for LLM-based Recommendation development,
finetuning, and evaluation
• OpenP5 is a general framework for LLM-based
recommendation model development based on P5 paradigm [1].
• Support different backbone LLMs, such as T5, LLaMA.
• GitHub Link: https://github.com/agiresearch/OpenP5/tree/main

91
[1] Geng, Shijie, et al. "Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5)." RecSys 2022.
OpenP5

92
[1] Geng, Shijie, et al. "Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5)." RecSys 2022.
OpenP5
• Popular datasets: 10 popular datasets, from Amazon, Yelp,
Movielens.

• Item indexing [1]: Random, Sequential, Collaborative

• Downstream tasks: Sequential, Straightforward

• Backbone LLMs: T5, LLaMA

• Training acceleration: Distributed Learning, LoRA

93
[1] Wenyue Hua, Shuyuan Xu, Yingqiang Ge, Yongfeng Zhang. "How to Index Item IDs for Recommendation Foundation Models." In Proceedings of SIGIR-AP 2023
A Hand-on Demo

94
Custom LLM-based Recommendation
• Apply new data: only require user-item interactions

• Apply new prompt template: add your prompt files

• Apply new backbone LLMs: import other backbone models pre-


trained from transformers

95
Summary and Future Vision

96
The Future of Generative Recommendation
• Recommendation as Personalized Generative AI
• Generate personalized contents for users based on prompts
• Prompt: "I am traveling in Singapore, generate some images for me to post on Instagram"
• Personalized generation of candidate images for users to consider

97
*Image generated with The New Bing
The Future of Generative Recommendation
• Recommendation as Personalized Generative Advertisement
• Personalized Advertisement Generation
• Same ad, different wording, real-time generation given user’s context
• e.g., an environmental protection ad for an NGO
For Children: For Business Leaders:

Join us in protecting our planet. Let's work Join the movement towards sustainability and create a brighter future
together to make the world a better place for your business and our planet. By adopting environmentally-friendly
for ourselves and for future generations. practices, you can reduce your costs, attract new customers, and
enhance your reputation as a responsible business leader. 98
*Text generated with ChatGPT
Summary
• Large Language Model for Recommendation – take aways
• From Discriminative Recommendation to Generative Recommendation
• From Multi-stage Ranking to Single-stage Ranking
• From Single-task learning to Multi-task learning
• From Single-modality modeling to Multi-modality modeling
• Key Topics
• Large Language Model based Recommendation Models and Evaluation
• Trustworthy Large Language Model for Recommendation
• Hands on tutorials of LLM-based recommendation model development

99
TORS Special Issue Call for Papers
• Topic: Large Language Models for Recommender Systems

• Submission deadline: December 15, 2023


• First-round review decisions: March 15, 2024
• Deadline for revision submissions: May 15, 2024
• Notification of final decisions: July 15, 2024

100
101

You might also like