Block a user
New deduplication
Deduplication using:
- TF-IDF vectorization on names of organizations
- Clustering using DBSCAN. This leaves us with clusters of organizations with a high enough probability to match (using…
Deduplication and switch to Polars
About polars I need to do one more simple change. I have planned to do it by the end of Monday.
The goal of polars is that it handles large data frames much faster than pandas. With my testing…
feature/deduplication-polars
feature/new_archive_structure
Update I never considered that in actual production you would never be running only two months for 2024 with unimplemented e-forms. So technically when I run from 2023-10 to 2024-03 everything…
feature/new_archive_structure
But what's very interesting is the eForms ratio. This really validates TEDective/etl#18 and this should probably be the main focus for now.
Yes, you are right.…