5.14 The find_near_duplicates tool

Read-only diagnostic for memory cleanup. Returns clusters of active memories whose embeddings are at least threshold similar to each other. Clusters are formed by union-find over qualifying pairs (so transitively similar rows a~b~c land in the same group) and sorted strongest-first by the max pair-similarity within each cluster. Soft-deleted and expired rows are excluded.

Pair candidates are looked up via pgvector’s HNSW index using a fixed top-10 per row, which is more than enough at the cutoffs typical for cleanup (threshold >= 0.85).

5.14.1 Inputs

threshold

Number, 0.5–1.0. Optional, default 0.92. Cosine-similarity cutoff for pair candidates. 0.95+ is typical for “effectively identical” rows.

max_groups

Integer. Optional, default 50. Hard cap 500.

5.14.2 Output

{
  "ok": true,
  "threshold": 0.92,
  "groups": [
    {
      "ids": [4, 5],
      "max_similarity": 0.984,
      "pair_count": 1,
      "memories": [
        { "id": 4, "text": "...", "tags": ["family"], ... },
        { "id": 5, "text": "...", "tags": ["family"], ... }
      ]
    }
  ],
  "summary": { "groups": 5, "pairs": 8 }
}

pair_count is the number of qualifying pairs inside that cluster (always at least 1; higher means the cluster is denser). pairs in the summary is the total raw pair count across all returned clusters.

5.14.3 Errors

INVALID_INPUT (out-of-range threshold or max_groups), DATABASE_ERROR.