i.galabov i.galabov
  • Joined on 2024-02-13
i.galabov pushed to save_point at TEDective/etl 2024-06-24 14:02:20 +00:00
08a182bfbd Revert "Merge branch 'main' into feature/deduplication-polars"
i.galabov created branch save_point in TEDective/etl 2024-06-24 14:02:19 +00:00
i.galabov pushed to main at TEDective/docs 2024-06-24 09:59:59 +00:00
13edf53c50 Merge pull request 'deduplication-docs' (#11) from deduplication-docs into main
0650285e52 fix: forgot to add files
357f59d133 feat: Deduplication docs
Compare 3 commits »
i.galabov merged pull request TEDective/docs#11 2024-06-24 09:59:58 +00:00
deduplication-docs
i.galabov created pull request TEDective/docs#11 2024-06-24 09:59:33 +00:00
deduplication-docs
i.galabov pushed to deduplication-docs at TEDective/docs 2024-06-24 09:59:18 +00:00
0650285e52 fix: forgot to add files
i.galabov pushed to deduplication-docs at TEDective/docs 2024-06-24 01:18:48 +00:00
357f59d133 feat: Deduplication docs
i.galabov created branch deduplication-docs in TEDective/docs 2024-06-24 01:18:47 +00:00
i.galabov commented on pull request TEDective/etl#38 2024-06-23 18:19:30 +00:00
New deduplication

Deduplication using:

  1. TF-IDF vectorization on names of organizations
  2. Clustering using DBSCAN. This leaves us with clusters of organizations with a high enough probability to match (using…
i.galabov pushed to feature/deduplication-polars at TEDective/etl 2024-06-23 16:52:08 +00:00
0c29cfa1db Merge branch 'main' into feature/deduplication-polars
0a35ea3f1f fix: testing
8a2c3ac69f fix: typo
97790d4cb9 fix: duplicated org id's should not be the case anymore
c2a1b3e2f8 fix: make phone extraction more robust against missing values
Compare 7 commits »
i.galabov commented on pull request TEDective/etl#37 2024-06-16 09:47:25 +00:00
Deduplication and switch to Polars

About polars I need to do one more simple change. I have planned to do it by the end of Monday.

The goal of polars is that it handles large data frames much faster than pandas. With my testing…

i.galabov created pull request TEDective/etl#37 2024-06-16 09:23:21 +00:00
feature/deduplication-polars
i.galabov created branch feature/deduplication-polars in TEDective/etl 2024-06-14 14:45:17 +00:00
i.galabov pushed to feature/deduplication-polars at TEDective/etl 2024-06-14 14:45:17 +00:00
331e2403bc Merge branch 'testing' into feature/deduplication-polars
e7b0da6bc8 fix: deleted one comment
f90bfe05ac New deduplication which will be documented. Using polars for the data manipulation
Compare 3 commits »
i.galabov approved TEDective/etl#35 2024-05-25 21:04:10 +00:00
Week 3/4 ETL

Solid work. I have started looking through the problems that we have in deduplication.

i.galabov commented on pull request TEDective/etl#35 2024-05-25 21:00:22 +00:00
Week 3/4 ETL

This will need to be fixed. I will get it done once I dive into deduplication

i.galabov commented on pull request TEDective/etl#35 2024-05-25 20:58:47 +00:00
Week 3/4 ETL

Any specific reasons for these changes or just consistency?

i.galabov commented on pull request TEDective/etl#29 2024-04-18 13:58:07 +00:00
feature/new_archive_structure

Update I never considered that in actual production you would never be running only two months for 2024 with unimplemented e-forms. So technically when I run from 2023-10 to 2024-03 everything…

i.galabov pushed to feature/new_archive_structure at TEDective/etl 2024-04-18 13:14:02 +00:00
138427ee71 Merge branch 'development' into feature/new_archive_structure
a8606e11f3 Merge pull request 'Fix rates missing rates' (#31) from fix/division-by-zero-currency into development
a71c852e50 fix: handle missing rates
9900b1fc45 Merge branch 'main' into development
Compare 4 commits »
i.galabov commented on pull request TEDective/etl#29 2024-04-18 12:39:30 +00:00
feature/new_archive_structure

But what's very interesting is the eForms ratio. This really validates TEDective/etl#18 and this should probably be the main focus for now.

Yes, you are right.…