How to evaluate memory of AI assistant? (editing)

Last Updated: 2025-03-29

<<<>>>

0. Overview

The research is about long-term memory ability of ai chat assistant. The contribution of the research could be divided into two parts:

  • Guiding how to make dataset and use it for evaluating long-term memory of ai assistant
  • Suggestion of design of long term memory, and Experiments on it

Before getting started, lets check what values are of long-term memory, refered to the research.

  • Information Extraction
  • Multi-serssion Reasoning
  • Knowledge Updates
  • Temporal Reasoning
  • Abstention

Index

  1. Structure of Evaluation Dataset
  2. Evaluation method
  3. Dataset building pipeline
  4. long-term memory design suggestions
  5. Experiments on long-term memory

1. Structure of Evaluation Dataset

Each item of Evaluation Dataset, or a problem, is structured like below:
$$ problem\,=\,(\textbf{S},q, t_q, a) \ \textbf{S} = [(t_1, S_1), (t_2, S_2), …, (t_N, S_N)] $$

  • $S_i$: multi-turn interaction between user and ai-assistant
  • $q$: question
  • $a$: answer
  • $t_q$: timestamp of QA session ( $t_q > t_N$ )

The answer is desired from answer and user's interaction history.

2. Evaluation Methods

Two main methods:

  • Question-Answering
  • Memory Recall

a. Question Answering

Main idea: "Is ai assistant's answer good for given question?"

Target of the evaluation:

  • AI assistant with the memories from dataset.

Evaluator:

  • Prompt engineered LLM.
  • Prompts are different by question task
    • temp-reasoning, knowledge-update, single sessinon preference, etc
  • Extracts quality of answer of target assistant as figure.

How to evaluate the Evaluator?

  • Human expert determines by checking (question, answer of target assistant , answer of prompt engineered llm)

b. Memory Recall

Evaluates retrieval performance.
How?

  • The evaluation dataset has Question &

3. Dataset building pipeline

Main flow is like:

  1. make evidence conversation
  2. make whole chat conversation

4. long-term memory design suggestions

5. Experiments on long-term memory