How to evaluate memory of AI assistant? (editing)
Last Updated: 2025-03-29
<<<>>>
0. Overview
The research is about long-term memory ability of ai chat assistant. The contribution of the research could be divided into two parts:
- Guiding how to make dataset and use it for evaluating long-term memory of ai assistant
- Suggestion of design of long term memory, and Experiments on it
Before getting started, lets check what values are of long-term memory, refered to the research.
- Information Extraction
- Multi-serssion Reasoning
- Knowledge Updates
- Temporal Reasoning
- Abstention
Index
- Structure of Evaluation Dataset
- Evaluation method
- Dataset building pipeline
- long-term memory design suggestions
- Experiments on long-term memory
1. Structure of Evaluation Dataset
Each item of Evaluation Dataset, or a problem, is structured like below:
$$
problem\,=\,(\textbf{S},q, t_q, a) \
\textbf{S} = [(t_1, S_1), (t_2, S_2), …, (t_N, S_N)]
$$
- $S_i$: multi-turn interaction between user and ai-assistant
- $q$: question
- $a$: answer
- $t_q$: timestamp of QA session ( $t_q > t_N$ )
The answer is desired from answer and user's interaction history.
2. Evaluation Methods
Two main methods:
- Question-Answering
- Memory Recall
a. Question Answering
Main idea: "Is ai assistant's answer good for given question?"
Target of the evaluation:
- AI assistant with the memories from dataset.
Evaluator:
- Prompt engineered LLM.
- Prompts are different by question task
- temp-reasoning, knowledge-update, single sessinon preference, etc
- Extracts quality of answer of target assistant as figure.
How to evaluate the Evaluator?
- Human expert determines by checking (question, answer of target assistant , answer of prompt engineered llm)
b. Memory Recall
Evaluates retrieval performance.
How?
- The evaluation dataset has Question &
3. Dataset building pipeline
Main flow is like:
- make evidence conversation
- make whole chat conversation