Skip to main content

Find Duplicates

GET/api/v1/idap/businesses/duplicates

Overview

Return clusters of businesses rows in your tenant that share the same value for a whitelisted external key (e.g. two or more rows with the same google_place_id). Each cluster has count >= 2 — singletons are filtered out by HAVING COUNT(*) > 1 server-side.

This is the read-side primitive for dedupe workflows: surface candidates here, decide which UUID is canonical, then call Delete Business on the redundant rows.

Currently businesses only

Today the duplicates finder is supported only on businesses. Requests against other resource types (/idap/contacts/duplicates, etc.) return 400 with a helpful message. New resource types will be added as the dedupe roadmap expands.

Whitelisted Keys

The key query parameter is a closed whitelist — free strings outside this list return 400 with the allowed values:

KeyBacked byNotes
google_place_idDirect column on businessesGoogle Maps CID / Place ID. UNIQUE constraint per tenant — clusters here usually reflect upstream Maps duplicates or backfills.
domainDirect column on businessesPrimary domain. Common source of duplicates when leads land via both Maps and SpiderSite crawls.
phone_e164Direct column on businessesE.164 phone number. Useful for catching multi-listing storefronts.
vatcompany_registry.vat_number (joined via company_registry.business_id)VAT number. The server joins company_registry and clusters by vat_number.
registration_numbercompany_registry.registration_number (joined)Local company-registry number.
leicompany_registry.lei (joined)Legal Entity Identifier.
tax_idcompany_registry.tax_id (joined)Tax ID (US EIN, etc.).

Path Parameters

resource_typestringrequired

Only businesses is currently supported. Passing any other type returns 400.

Query Parameters

keystringrequired

The clustering key. Must be one of the whitelisted values above.

Example: key=google_place_id

limitintegerdefault: 100

Maximum number of clusters to return (1–500). Clusters are ordered by count DESC, key_value ASC so the most-collided values surface first.

Response

resource_typestring

The resource type the duplicates were grouped on (always businesses today).

keystring

The whitelisted key used for clustering (echoed back).

clustersarray

Duplicate clusters, ordered by count DESC, key_value ASC. Empty array means no duplicates exist for this key on this tenant. Each entry has:

  • key_value — the shared external value (e.g. the Google Place ID, domain, or VAT).
  • count — number of resources sharing this value (always >= 2).
  • resource_ids — UUIDs of the rows in this cluster (from norm_cli_<your_client>.businesses).
total_clustersinteger

Number of clusters returned (== len(clusters), capped by the request's limit).

Example Request

# Cluster businesses that share a google_place_id
curl "https://spideriq.ai/api/v1/idap/businesses/duplicates?key=google_place_id&limit=50" \
-H "Authorization: Bearer $TOKEN"

Response Example

{
"resource_type": "businesses",
"key": "google_place_id",
"clusters": [
{
"key_value": "0x47e66fdad6f1cc73:0x341211b3fccd79e1",
"count": 3,
"resource_ids": [
"a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"b2c3d4e5-f6a7-8901-bcde-f12345678901",
"c3d4e5f6-a7b8-9012-cdef-123456789012"
]
},
{
"key_value": "ChIJTbk_T0OABDsRAA0AAAAAAAA",
"count": 2,
"resource_ids": [
"d4e5f6a7-b8c9-0123-defa-234567890123",
"e5f6a7b8-c9d0-1234-efab-345678901234"
]
}
],
"total_clusters": 2
}
{
"resource_type": "businesses",
"key": "lei",
"clusters": [],
"total_clusters": 0
}
{
"detail": "Key 'name' is not valid for /businesses/duplicates. Allowed: ['google_place_id', 'domain', 'phone_e164', 'vat', 'registration_number', 'lei', 'tax_id']"
}
{
"detail": "Duplicates finder is not yet supported on resource type 'contacts'. Currently supported: ['businesses']"
}

Dedupe Workflow

The full read → choose → delete loop:

# 1. Find clusters
curl "https://spideriq.ai/api/v1/idap/businesses/duplicates?key=domain" \
-H "Authorization: Bearer $TOKEN"

# 2. For each cluster, decide which UUID is canonical, then delete the rest.
# The DELETE endpoint cascades to related rows + writes an audit row.
curl -X DELETE "https://spideriq.ai/api/v1/idap/businesses/$DUPLICATE_UUID" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"reason": "duplicate of '"$CANONICAL_UUID"'"}'

See Delete Business for the full cascade + audit details and the 409 booking_conflict envelope.