Deduplication with SpiderFuzzer
Overview
SpiderFuzzer is SpiderIQ's built-in deduplication system that prevents duplicate records across your scraping campaigns. Available since v2.18.0, SpiderFuzzer provides:
- Per-client isolation - Your data is stored in a separate PostgreSQL schema
- Automatic deduplication - Enable via payload flags on any job type
- Standalone API - Check and manage records directly via dedicated endpoints
- Multi-field matching - Match on email, google_place_id, linkedin_url, phone, or domain
Two Ways to Use SpiderFuzzer
1. Automatic Mode (In Jobs)
Add fuzziq_enabled: true to any job payload and records are automatically checked:
curl -X POST https://spideriq.ai/api/v1/jobs/spiderMaps/submit \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"payload": {
"search_query": "restaurants in Paris",
"max_results": 20,
"fuzziq_enabled": true
}
}'
Results include a fuzziq_unique flag on each record:
{
"data": {
"businesses": [
{
"name": "Le Petit Bistro",
"place_id": "ChIJ123...",
"fuzziq_unique": true
},
{
"name": "Cafe de Flore",
"place_id": "ChIJ456...",
"fuzziq_unique": false
}
]
}
}
2. Standalone API
Use SpiderFuzzer endpoints directly for custom workflows:
| Endpoint | Use Case |
|---|---|
POST /fuzziq/check | Check single record |
POST /fuzziq/check-batch | Check up to 100 records |
POST /fuzziq/canonical/import | Bulk import up to 1000 records |
GET /fuzziq/canonical | List your canonical records |
GET /fuzziq/canonical/stats | Get database statistics |
Record Types
SpiderFuzzer supports four record types, each with different matching priorities:
Business
Google Maps businesses
Key fields: google_place_id, company_name, phone
Best for: SpiderMaps results
Contact
People/contacts
Key fields: email, full_name, linkedin_url
Best for: SpiderSite extracted contacts
Email-only records
Key fields: email
Best for: SpiderVerify results
Profile
LinkedIn profiles
Key fields: linkedin_url, full_name
Best for: SpiderPeople results
Match Types
SpiderFuzzer checks fields in priority order (first match wins):
| Priority | Field | Description |
|---|---|---|
| 1 | google_place_id | Exact match (best for businesses) |
| 2 | email | Exact match (normalized, lowercase) |
| 3 | linkedin_url | Exact match (normalized) |
| 4 | phone | Exact match (normalized, digits only) |
| 5 | company_domain | Exact match |
| 6 | exact_hash | SHA256 hash of all normalized fields |
All matches return confidence: 1.0 (exact matching only, no fuzzy matching yet).
Common Workflows
Deduplicate Multi-Location Campaigns
When scraping the same business type across multiple cities, the same chains appear in each location:
import requests
TOKEN = "your_token"
headers = {"Authorization": f"Bearer {TOKEN}", "Content-Type": "application/json"}
# Run campaign across 3 cities
cities = ["Paris", "Lyon", "Marseille"]
all_unique_businesses = []
for city in cities:
# Submit SpiderMaps job with SpiderFuzzer enabled
job = requests.post(
"https://spideriq.ai/api/v1/jobs/spiderMaps/submit",
headers=headers,
json={
"payload": {
"search_query": f"restaurants in {city}",
"max_results": 50,
"fuzziq_enabled": True,
"fuzziq_unique_only": True # Only return unique records
}
}
).json()
# Wait for results...
results = poll_for_results(job["job_id"])
# All returned businesses are guaranteed unique
all_unique_businesses.extend(results["data"]["businesses"])
print(f"{city}: {len(results['data']['businesses'])} unique businesses")
print(f"Total unique across all cities: {len(all_unique_businesses)}")
Pre-seed from CRM
Before running campaigns, import your existing customers so they're automatically excluded:
# Import existing customers
customers = [
{"company_domain": "existingcustomer1.com"},
{"company_domain": "existingcustomer2.com"},
# ... more customers
]
requests.post(
"https://spideriq.ai/api/v1/fuzziq/canonical/import",
headers=headers,
json={
"records": customers,
"record_type": "business",
"skip_duplicates": True
}
)
# Now when you run SpiderMaps, existing customers are marked as duplicates
Check Before Expensive Operations
Use SpiderFuzzer batch check to filter records before SpiderSite or SpiderVerify:
# Got 100 businesses from SpiderMaps
businesses = spidermaps_result["data"]["businesses"]
# Check with SpiderFuzzer before running SpiderSite on each
dedup_result = requests.post(
"https://spideriq.ai/api/v1/fuzziq/check-batch",
headers=headers,
json={
"records": [
{"google_place_id": b["place_id"], "company_name": b["name"]}
for b in businesses
],
"record_type": "business",
"add_to_canonical": True # Add unique ones automatically
}
).json()
# Only run SpiderSite on unique businesses (saves money!)
unique_businesses = dedup_result["unique"]
print(f"Running SpiderSite on {len(unique_businesses)} unique businesses (skipped {dedup_result['stats']['duplicate_count']} duplicates)")
Job Payload Options
Add these to any job payload:
| Parameter | Type | Default | Description |
|---|---|---|---|
fuzziq_enabled | boolean | client default | Enable FuzzIQ for this job |
fuzziq_unique_only | boolean | false | Only return unique records |
Example:
{
"payload": {
"search_query": "coffee shops in Berlin",
"fuzziq_enabled": true,
"fuzziq_unique_only": true
}
}
Data Isolation
Each client has a completely isolated PostgreSQL schema:
SpiderFuzzer Database
├── Schema: client_abc123 (your data)
│ └── canonical_records: 15,420 rows
├── Schema: client_def456 (another client)
│ └── canonical_records: 8,750 rows
└── Schema: client_ghi789 (another client)
└── canonical_records: 42,100 rows
Your data is never mixed with other clients.
API Reference
Check Single
Check one record for duplicates
Check Batch
Check up to 100 records
List Records
Browse your canonical database
Add Record
Manually add a record
Bulk Import
Import up to 1000 records
Statistics
View database stats
Best Practices
Seed your database before campaigns
Import existing customers, competitors, or blocked domains before running campaigns. This prevents wasting resources on records you already have.
Use batch check for efficiency
Instead of checking records one by one, use check-batch with up to 100 records per request. For imports, use canonical/import with up to 1000 records.
Enable fuzziq_unique_only for clean results
When you only want new records, set fuzziq_unique_only: true in your job payload. This filters duplicates server-side so you only process unique data.
Use idempotency keys for retries
When using check-batch, include an idempotency_key to safely retry failed requests without creating duplicate canonical entries.
SpiderFuzzer is automatically enabled for all clients. Contact support if you need to adjust your settings or have questions about your canonical database.