Find Duplicates
/api/v1/idap/businesses/duplicatesOverview
Return clusters of businesses rows in your tenant that share the same value for a whitelisted external key (e.g. two or more rows with the same google_place_id). Each cluster has count >= 2 — singletons are filtered out by HAVING COUNT(*) > 1 server-side.
This is the read-side primitive for dedupe workflows: surface candidates here, decide which UUID is canonical, then call Delete Business on the redundant rows.
businesses onlyToday the duplicates finder is supported only on businesses. Requests against other resource types (/idap/contacts/duplicates, etc.) return 400 with a helpful message. New resource types will be added as the dedupe roadmap expands.
Whitelisted Keys
The key query parameter is a closed whitelist — free strings outside this list return 400 with the allowed values:
| Key | Backed by | Notes |
|---|---|---|
google_place_id | Direct column on businesses | Google Maps CID / Place ID. UNIQUE constraint per tenant — clusters here usually reflect upstream Maps duplicates or backfills. |
domain | Direct column on businesses | Primary domain. Common source of duplicates when leads land via both Maps and SpiderSite crawls. |
phone_e164 | Direct column on businesses | E.164 phone number. Useful for catching multi-listing storefronts. |
vat | company_registry.vat_number (joined via company_registry.business_id) | VAT number. The server joins company_registry and clusters by vat_number. |
registration_number | company_registry.registration_number (joined) | Local company-registry number. |
lei | company_registry.lei (joined) | Legal Entity Identifier. |
tax_id | company_registry.tax_id (joined) | Tax ID (US EIN, etc.). |
Path Parameters
resource_typestringrequiredOnly businesses is currently supported. Passing any other type returns 400.
Query Parameters
keystringrequiredThe clustering key. Must be one of the whitelisted values above.
Example: key=google_place_id
limitintegerdefault: 100Maximum number of clusters to return (1–500). Clusters are ordered by count DESC, key_value ASC so the most-collided values surface first.
Response
resource_typestringThe resource type the duplicates were grouped on (always businesses today).
keystringThe whitelisted key used for clustering (echoed back).
clustersarrayDuplicate clusters, ordered by count DESC, key_value ASC. Empty array means no duplicates exist for this key on this tenant. Each entry has:
key_value— the shared external value (e.g. the Google Place ID, domain, or VAT).count— number of resources sharing this value (always>= 2).resource_ids— UUIDs of the rows in this cluster (fromnorm_cli_<your_client>.businesses).
total_clustersintegerNumber of clusters returned (== len(clusters), capped by the request's limit).
Example Request
- cURL
- Python
# Cluster businesses that share a google_place_id
curl "https://spideriq.ai/api/v1/idap/businesses/duplicates?key=google_place_id&limit=50" \
-H "Authorization: Bearer $TOKEN"
import httpx
response = httpx.get(
"https://spideriq.ai/api/v1/idap/businesses/duplicates",
params={"key": "domain", "limit": 100},
headers={"Authorization": f"Bearer {token}"},
)
data = response.json()
for cluster in data["clusters"]:
print(f"{cluster['count']}x {cluster['key_value']}")
for rid in cluster["resource_ids"]:
print(f" {rid}")
Response Example
{
"resource_type": "businesses",
"key": "google_place_id",
"clusters": [
{
"key_value": "0x47e66fdad6f1cc73:0x341211b3fccd79e1",
"count": 3,
"resource_ids": [
"a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"b2c3d4e5-f6a7-8901-bcde-f12345678901",
"c3d4e5f6-a7b8-9012-cdef-123456789012"
]
},
{
"key_value": "ChIJTbk_T0OABDsRAA0AAAAAAAA",
"count": 2,
"resource_ids": [
"d4e5f6a7-b8c9-0123-defa-234567890123",
"e5f6a7b8-c9d0-1234-efab-345678901234"
]
}
],
"total_clusters": 2
}
{
"resource_type": "businesses",
"key": "lei",
"clusters": [],
"total_clusters": 0
}
{
"detail": "Key 'name' is not valid for /businesses/duplicates. Allowed: ['google_place_id', 'domain', 'phone_e164', 'vat', 'registration_number', 'lei', 'tax_id']"
}
{
"detail": "Duplicates finder is not yet supported on resource type 'contacts'. Currently supported: ['businesses']"
}
Dedupe Workflow
The full read → choose → delete loop:
# 1. Find clusters
curl "https://spideriq.ai/api/v1/idap/businesses/duplicates?key=domain" \
-H "Authorization: Bearer $TOKEN"
# 2. For each cluster, decide which UUID is canonical, then delete the rest.
# The DELETE endpoint cascades to related rows + writes an audit row.
curl -X DELETE "https://spideriq.ai/api/v1/idap/businesses/$DUPLICATE_UUID" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"reason": "duplicate of '"$CANONICAL_UUID"'"}'
See Delete Business for the full cascade + audit details and the 409 booking_conflict envelope.