Deduplication with SpiderFuzzer - SpiderIQ by Di-Atomic

Overview

SpiderFuzzer is SpiderIQ’s built-in deduplication system that prevents duplicate records across your scraping campaigns. Available since v2.18.0, SpiderFuzzer provides:

Per-client isolation - Your data is stored in a separate PostgreSQL schema
Automatic deduplication - Enable via payload flags on any job type
Standalone API - Check and manage records directly via dedicated endpoints
Multi-field matching - Match on email, google_place_id, linkedin_url, phone, or domain

Two Ways to Use SpiderFuzzer

1. Automatic Mode (In Jobs)

Add fuzziq_enabled: true to any job payload and records are automatically checked:

curl -X POST https://spideriq.ai/api/v1/jobs/spiderMaps/submit \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "payload": {
      "search_query": "restaurants in Paris",
      "max_results": 20,
      "fuzziq_enabled": true
    }
  }'

Results include a fuzziq_unique flag on each record:

{
  "data": {
    "businesses": [
      {
        "name": "Le Petit Bistro",
        "place_id": "ChIJ123...",
        "fuzziq_unique": true
      },
      {
        "name": "Cafe de Flore",
        "place_id": "ChIJ456...",
        "fuzziq_unique": false
      }
    ]
  }
}

2. Standalone API

Use SpiderFuzzer endpoints directly for custom workflows:

Endpoint	Use Case
`POST /fuzziq/check`	Check single record
`POST /fuzziq/check-batch`	Check up to 100 records
`POST /fuzziq/canonical/import`	Bulk import up to 1000 records
`GET /fuzziq/canonical`	List your canonical records
`GET /fuzziq/canonical/stats`	Get database statistics

Record Types

SpiderFuzzer supports four record types, each with different matching priorities:

Business

Google Maps businessesKey fields: google_place_id, company_name, phoneBest for: SpiderMaps results

Contact

People/contactsKey fields: email, full_name, linkedin_urlBest for: SpiderSite extracted contacts

Email

Email-only recordsKey fields: emailBest for: SpiderVerify results

Profile

LinkedIn profilesKey fields: linkedin_url, full_nameBest for: SpiderPeople results

Match Types

SpiderFuzzer checks fields in priority order (first match wins):

Priority	Field	Description
1	`google_place_id`	Exact match (best for businesses)
2	`email`	Exact match (normalized, lowercase)
3	`linkedin_url`	Exact match (normalized)
4	`phone`	Exact match (normalized, digits only)
5	`company_domain`	Exact match
6	`exact_hash`	SHA256 hash of all normalized fields

All matches return confidence: 1.0 (exact matching only, no fuzzy matching yet).

Common Workflows

Deduplicate Multi-Location Campaigns

When scraping the same business type across multiple cities, the same chains appear in each location:

import requests

TOKEN = "your_token"
headers = {"Authorization": f"Bearer {TOKEN}", "Content-Type": "application/json"}

# Run campaign across 3 cities
cities = ["Paris", "Lyon", "Marseille"]
all_unique_businesses = []

for city in cities:
    # Submit SpiderMaps job with SpiderFuzzer enabled
    job = requests.post(
        "https://spideriq.ai/api/v1/jobs/spiderMaps/submit",
        headers=headers,
        json={
            "payload": {
                "search_query": f"restaurants in {city}",
                "max_results": 50,
                "fuzziq_enabled": True,
                "fuzziq_unique_only": True  # Only return unique records
            }
        }
    ).json()

    # Wait for results...
    results = poll_for_results(job["job_id"])

    # All returned businesses are guaranteed unique
    all_unique_businesses.extend(results["data"]["businesses"])
    print(f"{city}: {len(results['data']['businesses'])} unique businesses")

print(f"Total unique across all cities: {len(all_unique_businesses)}")

Pre-seed from CRM

Before running campaigns, import your existing customers so they’re automatically excluded:

# Import existing customers
customers = [
    {"company_domain": "existingcustomer1.com"},
    {"company_domain": "existingcustomer2.com"},
    # ... more customers
]

requests.post(
    "https://spideriq.ai/api/v1/fuzziq/canonical/import",
    headers=headers,
    json={
        "records": customers,
        "record_type": "business",
        "skip_duplicates": True
    }
)

# Now when you run SpiderMaps, existing customers are marked as duplicates

Check Before Expensive Operations

Use SpiderFuzzer batch check to filter records before SpiderSite or SpiderVerify:

# Got 100 businesses from SpiderMaps
businesses = spidermaps_result["data"]["businesses"]

# Check with SpiderFuzzer before running SpiderSite on each
dedup_result = requests.post(
    "https://spideriq.ai/api/v1/fuzziq/check-batch",
    headers=headers,
    json={
        "records": [
            {"google_place_id": b["place_id"], "company_name": b["name"]}
            for b in businesses
        ],
        "record_type": "business",
        "add_to_canonical": True  # Add unique ones automatically
    }
).json()

# Only run SpiderSite on unique businesses (saves money!)
unique_businesses = dedup_result["unique"]
print(f"Running SpiderSite on {len(unique_businesses)} unique businesses (skipped {dedup_result['stats']['duplicate_count']} duplicates)")

Job Payload Options

Add these to any job payload:

Parameter	Type	Default	Description
`fuzziq_enabled`	boolean	client default	Enable FuzzIQ for this job
`fuzziq_unique_only`	boolean	`false`	Only return unique records

Example:

{
  "payload": {
    "search_query": "coffee shops in Berlin",
    "fuzziq_enabled": true,
    "fuzziq_unique_only": true
  }
}

Data Isolation

Each client has a completely isolated PostgreSQL schema:

SpiderFuzzer Database
├── Schema: client_abc123 (your data)
│   └── canonical_records: 15,420 rows
├── Schema: client_def456 (another client)
│   └── canonical_records: 8,750 rows
└── Schema: client_ghi789 (another client)
    └── canonical_records: 42,100 rows

Your data is never mixed with other clients.

API Reference

Check Single

Check one record for duplicates

Check Batch

Check up to 100 records

List Records

Browse your canonical database

Add Record

Manually add a record

Bulk Import

Import up to 1000 records

Statistics

View database stats

Best Practices

Seed your database before campaigns

Import existing customers, competitors, or blocked domains before running campaigns. This prevents wasting resources on records you already have.

Use batch check for efficiency

Instead of checking records one by one, use check-batch with up to 100 records per request. For imports, use canonical/import with up to 1000 records.

Enable fuzziq_unique_only for clean results

When you only want new records, set fuzziq_unique_only: true in your job payload. This filters duplicates server-side so you only process unique data.

Use idempotency keys for retries

When using check-batch, include an idempotency_key to safely retry failed requests without creating duplicate canonical entries.

SpiderFuzzer is automatically enabled for all clients. Contact support if you need to adjust your settings or have questions about your canonical database.

Guides

​Overview

​Two Ways to Use SpiderFuzzer

​1. Automatic Mode (In Jobs)

​2. Standalone API

​Record Types

Business

Contact

Email

Profile

​Match Types

​Common Workflows

​Deduplicate Multi-Location Campaigns

​Pre-seed from CRM

​Check Before Expensive Operations

​Job Payload Options

​Data Isolation

​API Reference

Check Single

Check Batch

List Records

Add Record

Bulk Import

Statistics

​Best Practices

Overview

Two Ways to Use SpiderFuzzer

1. Automatic Mode (In Jobs)

2. Standalone API

Record Types

Match Types

Common Workflows

Deduplicate Multi-Location Campaigns

Pre-seed from CRM

Check Before Expensive Operations

Job Payload Options

Data Isolation

API Reference

Best Practices