Address de-duplication using iterative k-core graph decomposition
2024
A de-duplicated and complete address catalog is essential for any application or business which needs to manage large volumes of address data such as delivery logistics, first-responder services and government databases. For catalog creation, address data is usually procured from disparate sources, which often vary in quality, coverage, and introduce duplicates or variations of the same physical address. Address de-duplication is therefore a crucial step for creating a clean and unified address catalog. De-duplication is even more challenging at a global scale, due to diversity in address writing styles, which might lack standardized addressing systems and can be multi-lingual. In this paper, we formulate address de-duplication as an unsupervised graph clustering problem and propose SANGAM, a novel adaptation of the k-core graph decomposition algorithm. We evaluate this solution on diverse geographic regions around the world. In comparison to existing methods, we observe improvements on the F-beta measure for three datasets. Our key contributions are: (1) formulating address de-duplication as a graph clustering problem, (2) proposing SANGAM, a robust and generic de-duplication approach, and (3) validating its effectiveness on diverse geographies across three continents - Americas, Africa and Europe. (4) Further, we deploy our solution and show the positive impact on geocode learning, an essential application of our solution.
Research areas