Overview

SpiderFuzzer is SpiderIQ’s built-in deduplication system that prevents duplicate records across your scraping campaigns. Available since v2.18.0, SpiderFuzzer provides:
  • Per-client isolation - Your data is stored in a separate PostgreSQL schema
  • Automatic deduplication - Enable via payload flags on any job type
  • Standalone API - Check and manage records directly via dedicated endpoints
  • Multi-field matching - Match on email, google_place_id, linkedin_url, phone, or domain

Two Ways to Use SpiderFuzzer

1. Automatic Mode (In Jobs)

Add fuzziq_enabled: true to any job payload and records are automatically checked:
curl -X POST https://spideriq.ai/api/v1/jobs/spiderMaps/submit \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "payload": {
      "search_query": "restaurants in Paris",
      "max_results": 20,
      "fuzziq_enabled": true
    }
  }'
Results include a fuzziq_unique flag on each record:
{
  "data": {
    "businesses": [
      {
        "name": "Le Petit Bistro",
        "place_id": "ChIJ123...",
        "fuzziq_unique": true
      },
      {
        "name": "Cafe de Flore",
        "place_id": "ChIJ456...",
        "fuzziq_unique": false
      }
    ]
  }
}

2. Standalone API

Use SpiderFuzzer endpoints directly for custom workflows:
EndpointUse Case
POST /fuzziq/checkCheck single record
POST /fuzziq/check-batchCheck up to 100 records
POST /fuzziq/canonical/importBulk import up to 1000 records
GET /fuzziq/canonicalList your canonical records
GET /fuzziq/canonical/statsGet database statistics

Record Types

SpiderFuzzer supports four record types, each with different matching priorities:

Business

Google Maps businessesKey fields: google_place_id, company_name, phoneBest for: SpiderMaps results

Contact

People/contactsKey fields: email, full_name, linkedin_urlBest for: SpiderSite extracted contacts

Email

Email-only recordsKey fields: emailBest for: SpiderVerify results

Profile

LinkedIn profilesKey fields: linkedin_url, full_nameBest for: SpiderPeople results

Match Types

SpiderFuzzer checks fields in priority order (first match wins):
PriorityFieldDescription
1google_place_idExact match (best for businesses)
2emailExact match (normalized, lowercase)
3linkedin_urlExact match (normalized)
4phoneExact match (normalized, digits only)
5company_domainExact match
6exact_hashSHA256 hash of all normalized fields
All matches return confidence: 1.0 (exact matching only, no fuzzy matching yet).

Common Workflows

Deduplicate Multi-Location Campaigns

When scraping the same business type across multiple cities, the same chains appear in each location:
import requests

TOKEN = "your_token"
headers = {"Authorization": f"Bearer {TOKEN}", "Content-Type": "application/json"}

# Run campaign across 3 cities
cities = ["Paris", "Lyon", "Marseille"]
all_unique_businesses = []

for city in cities:
    # Submit SpiderMaps job with SpiderFuzzer enabled
    job = requests.post(
        "https://spideriq.ai/api/v1/jobs/spiderMaps/submit",
        headers=headers,
        json={
            "payload": {
                "search_query": f"restaurants in {city}",
                "max_results": 50,
                "fuzziq_enabled": True,
                "fuzziq_unique_only": True  # Only return unique records
            }
        }
    ).json()

    # Wait for results...
    results = poll_for_results(job["job_id"])

    # All returned businesses are guaranteed unique
    all_unique_businesses.extend(results["data"]["businesses"])
    print(f"{city}: {len(results['data']['businesses'])} unique businesses")

print(f"Total unique across all cities: {len(all_unique_businesses)}")

Pre-seed from CRM

Before running campaigns, import your existing customers so they’re automatically excluded:
# Import existing customers
customers = [
    {"company_domain": "existingcustomer1.com"},
    {"company_domain": "existingcustomer2.com"},
    # ... more customers
]

requests.post(
    "https://spideriq.ai/api/v1/fuzziq/canonical/import",
    headers=headers,
    json={
        "records": customers,
        "record_type": "business",
        "skip_duplicates": True
    }
)

# Now when you run SpiderMaps, existing customers are marked as duplicates

Check Before Expensive Operations

Use SpiderFuzzer batch check to filter records before SpiderSite or SpiderVerify:
# Got 100 businesses from SpiderMaps
businesses = spidermaps_result["data"]["businesses"]

# Check with SpiderFuzzer before running SpiderSite on each
dedup_result = requests.post(
    "https://spideriq.ai/api/v1/fuzziq/check-batch",
    headers=headers,
    json={
        "records": [
            {"google_place_id": b["place_id"], "company_name": b["name"]}
            for b in businesses
        ],
        "record_type": "business",
        "add_to_canonical": True  # Add unique ones automatically
    }
).json()

# Only run SpiderSite on unique businesses (saves money!)
unique_businesses = dedup_result["unique"]
print(f"Running SpiderSite on {len(unique_businesses)} unique businesses (skipped {dedup_result['stats']['duplicate_count']} duplicates)")

Job Payload Options

Add these to any job payload:
ParameterTypeDefaultDescription
fuzziq_enabledbooleanclient defaultEnable FuzzIQ for this job
fuzziq_unique_onlybooleanfalseOnly return unique records
Example:
{
  "payload": {
    "search_query": "coffee shops in Berlin",
    "fuzziq_enabled": true,
    "fuzziq_unique_only": true
  }
}

Data Isolation

Each client has a completely isolated PostgreSQL schema:
SpiderFuzzer Database
├── Schema: client_abc123 (your data)
│   └── canonical_records: 15,420 rows
├── Schema: client_def456 (another client)
│   └── canonical_records: 8,750 rows
└── Schema: client_ghi789 (another client)
    └── canonical_records: 42,100 rows
Your data is never mixed with other clients.

API Reference

Best Practices

Import existing customers, competitors, or blocked domains before running campaigns. This prevents wasting resources on records you already have.
Instead of checking records one by one, use check-batch with up to 100 records per request. For imports, use canonical/import with up to 1000 records.
When you only want new records, set fuzziq_unique_only: true in your job payload. This filters duplicates server-side so you only process unique data.
When using check-batch, include an idempotency_key to safely retry failed requests without creating duplicate canonical entries.
SpiderFuzzer is automatically enabled for all clients. Contact support if you need to adjust your settings or have questions about your canonical database.