Human-Like Memory Systems

A paper and codebase introducing RoomEnv-v0 and heuristic agents with explicit episodic and semantic memory.

Overview

Why explicit episodic and semantic memory matter for AI agents.

Authors: Taewoon Kim, Michael Cochez, Vincent Francois-Lavet, Mark Neerincx, and Piek Vossen.

Inspired by cognitive science theory, we explicitly model an agent with both semantic and episodic memory systems and ask whether that combination is better than relying on only one of the two. To make the question concrete, we introduce RoomEnv-v0, a challenging OpenAI Gym compatible environment where an agent must encode, store, retrieve, and use memories to answer object-location questions over time. The public RoomEnv repository is therefore part of the contribution alongside the agent repository for this project.

The benchmark also allows multiple agents to collaborate, so we study hybrid intelligence in addition to single-agent memory. The main findings are that mixed episodic-plus-semantic memory outperforms simpler baselines, pretrained commonsense helps further, and two agents collaborating can outperform one agent acting alone with the same total memory budget.

In cognitive science, explicit human memory is commonly discussed as a combination of semantic memory and episodic memory. Here we turn that distinction into an AI design problem: if an agent stores both kinds of memory explicitly, does it answer questions better under partial observability? Our three main contributions are to model mixed memory explicitly, introduce RoomEnv-v0 as the benchmark, and show that collaborative settings can improve results too. Read the full paper on arXiv.

Benchmark

RoomEnv-v0 tests memory under partial observability.

The OpenAI-Gym-compatible Room environment is one large room with $N_{\text{people}}$ people, $N_{\text{objects}}$ objects, $N_{\text{locations}}$ locations, and $N_{\text{agents}}$ agents. A human places an object at some location, but each agent can observe only one such placement event at a time. At the same step, the environment asks a question about an object's location, so the agent must answer from memory rather than from privileged access to the full room state.

Observations are represented as RDF-like quadruples and questions as structured relation queries:

\mathbf{x}^{(t)} = (\mathbf{h}^{(t)}, \mathbf{r}^{(t)}, \mathbf{t}^{(t)}, t), \qquad \mathbf{q}^{(t)} = (\mathbf{h}, \mathbf{r}).

For example, $\texttt{(laptop, at\_location, desk, 42)}$ records a specific observation, while $\texttt{(cat, at\_location)}$ asks where the cat is. A correct answer yields a reward of $+1$ and an incorrect answer yields $0$ .

The environment remains intentionally dynamic. At every step, several Bernoulli-controlled changes can happen:

with probability $p_{\text{commonsense}}$ , an object is placed in a commonsense location sourced from ConceptNet
with probability $p_{\text{new\_location}}$ , a human changes an object's location
with probability $p_{\text{new\_object}}$ , a human changes which object they carry
with probability $p_{\text{switch\_person}}$ , two people swap locations to mimic movement through the room

With one agent, the setup can be summarized as

S_t = (\mathbf{x}^{(t)}, \mathbf{q}^{(t)}), \qquad A_t = (\text{memory operation}, \text{answer}), \qquad R_t \in \{0, 1\}.

Each agent maintains bounded episodic and semantic memory stores. Episodic memory keeps person-specific events with timestamps, while semantic memory compresses repeated experience into generalized world knowledge with strengths instead of timestamps. We compare four handcrafted policies under equal memory budgets: episodic only, semantic only, both episodic and semantic, and both with pretrained ConceptNet commonsense knowledge. We also evaluate a multi-agent setting where memories can be combined across agents.

To simplify the benchmark, the experiments use a restricted subset of ConceptNet. The setup fixes 10 objects, 10 random human names, a single relation AtLocation, and a maximum episode length of 1,000 steps. The main environment probabilities are:

$p_{\text{commonsense}} = 0.7$
$p_{\text{new\_location}} = 0.1$
$p_{\text{new\_object}} = 0.1$
$p_{\text{switch\_person}} = 0.5$

Results

Mixed memory and collaboration deliver the strongest results.

The first result is straightforward but important: structured forgetting and retrieval policies beat random baselines. If the agent both forgets memories and answers questions uniformly at random, performance collapses. Once the memory system becomes explicit and retrieval policies become coherent, even heuristic agents become much stronger baselines than random memory behavior.

Handcrafted 1 episodic result — Handcrafted vs. random policies for the four explicit-memory baselines.

Handcrafted 2 semantic result — Handcrafted vs. random policies for the four explicit-memory baselines.

When memory capacity is small, episodic-only memory can perform better because there is not enough space to learn stable general world knowledge. As capacity increases, semantic memory becomes increasingly useful because the agent can generalize across many observations instead of treating every event as isolated.

The most interesting case is the agent with pretrained semantic memory. Because the general world knowledge is already present, the agent can focus more of its finite capacity on episodic recall, which yields the strongest overall results in this setup. The collaboration result is similarly important: two agents sharing memories can outperform a single agent with the same total budget because they cover different parts of the room and contribute complementary recall.

Best strategies across memory capacities

Total rewards with respect to different handcrafted policies and memory capacities.

Single-agent versus two-agent performance

Total rewards with respect to the number of agents. Collaboration improves recall and answer quality even under a fixed total memory budget.

Takeaways

Explicit episodic and semantic memory provide the foundation for the later work.

Theoretically, this work is closest to cognitive-science traditions such as ACT-R and Soar. Those systems provide strong conceptual motivation, but they do not provide the same kind of computational benchmark surface used here. Human-Like Memory Systems turns that theory into something experimentally comparable.

On the machine-learning side, some work studies memory and question answering computationally, but often focuses on episodic memory alone or stores memory in opaque numeric embeddings rather than structured symbolic records. Our contribution is to keep the memory systems explicit, interpretable, and directly comparable while still evaluating them in a nontrivial partially observable task.

The main takeaway is that explicit episodic and semantic memory systems are useful architectural primitives for AI agents operating under partial observability. Mixed memory outperforms single-memory baselines, commonsense pretraining helps, and collaboration between agents can further improve question answering. That makes Human-Like Memory Systems the benchmark-and-agents starting point for the later Explicit Memory and RoomKG work.

Resources

Paper, GitHub, and project links.

Read paperOpen View GitHub (Agent)Open View GitHub (RoomEnv)Open Parent projectOpen

Acknowledgements

Project support.

This research was partially funded by the Hybrid Intelligence Center, a 10-year program funded by the Dutch Ministry of Education, Culture and Science through the Netherlands Organisation for Scientific Research (NWO). Learn more at Hybrid Intelligence .

Cite

Cite our paper.

@misc{kim2026machinehumanlikememorysystems,
      title={A Machine With Human-Like Memory Systems}, 
      author={Taewoon Kim and Michael Cochez and Vincent Francois-Lavet and Mark Neerincx and Piek Vossen},
      year={2022},
      eprint={2204.01611},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2204.01611}, 
}