How to evaluate memory of AI assistant? (editing)

Last Update: 2025-03-29

<<<>>>

0. Overview

The research is about long-term memory ability of ai chat assistant. The contribution of the research could be divided into two parts:

Guiding how to make dataset and use it for evaluating long-term memory of ai assistant
Suggestion of design of long term memory, and Experiments on it

Before getting started, lets check what values are of long-term memory, refered to the research.

Information Extraction
Multi-serssion Reasoning
Knowledge Updates
Temporal Reasoning
Abstention

Index

Structure of Evaluation Dataset
Evaluation method
Dataset building pipeline
long-term memory design suggestions
Experiments on long-term memory

1. Structure of Evaluation Dataset

Each item of Evaluation Dataset, or a problem, is structured like below:
$$ problem\,=\,(\textbf{S},q, t_q, a) \ \textbf{S} = [(t_1, S_1), (t_2, S_2), …, (t_N, S_N)] $$

$S_i$: multi-turn interaction between user and ai-assistant
$q$: question
$a$: answer
$t_q$: timestamp of QA session ( $t_q > t_N$ )

The answer is desired from answer and user's interaction history.

2. Evaluation Methods

Two main methods:

Question-Answering
Memory Recall

a. Question Answering

Main idea: "Is ai assistant's answer good for given question?"

Target of the evaluation:

AI assistant with the memories from dataset.

Evaluator:

Prompt engineered LLM.
Prompts are different by question task
- temp-reasoning, knowledge-update, single sessinon preference, etc
Extracts quality of answer of target assistant as figure.

How to evaluate the Evaluator?

Human expert determines by checking (question, answer of target assistant , answer of prompt engineered llm)

b. Memory Recall

Evaluates retrieval performance.
How?

The evaluation dataset has Question &

3. Dataset building pipeline

Main flow is like:

make evidence conversation
make whole chat conversation