Improve, understand and document deduplication #20

Open
opened 2024-02-20 12:16:42 +00:00 by linus · 1 comment
Owner

Currently, the organization deduplication is working but surely can be improved. Also, it would be important to document what's happening at each step conceptually (in the docs) as well as technically (via code comments).

Currently, the organization deduplication is working but surely can be improved. Also, it would be important to document what's happening at each step conceptually (in the docs) as well as technically (via code comments).
linus added the
Kind/Enhancement
Priority
Medium
labels 2024-02-20 12:16:42 +00:00
Author
Owner

Also, I think it would be great if we could do the following:

  • calculate similarity scores between orgs (we already do that)
  • set a sensible threshold to identify a cluster (i.e. org nodes that are all linked by SIMILAR_TO edges with probabiliy > treshold).
  • choose a leader node, i.e. one node that represents the cluster.
  • Redirect all outgoing/incoming edges from the nodes in the cluster to the chosen leader note (should be relatively simple with parquet/polars). The SIMILAR_TO, edges should be preserved, however, such that the user can see which entities have been clustered.
Also, I think it would be great if we could do the following: - calculate similarity scores between orgs (we already do that) - set a sensible threshold to identify a cluster (i.e. org nodes that are all linked by `SIMILAR_TO` edges with probabiliy > treshold). - choose a leader node, i.e. one node that represents the cluster. - Redirect all outgoing/incoming edges from the nodes in the cluster to the chosen leader note (should be relatively simple with parquet/polars). The `SIMILAR_TO`, edges should be preserved, however, such that the user can see which entities have been clustered.
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: TEDective/etl#20
No description provided.