Skip to main content

SpiderSite Complete Guide

Overview

SpiderSite is an intelligent website crawler with AI-powered lead generation. It crawls websites, extracts contact information, and optionally applies AI analysis for company insights, team identification, and lead scoring.

info

Version 2.14.0: Compendiums now stored in SpiderMedia with permanent public URLs.

info

Version 2.10.0: All AI features now combine into a single efficient API call, including custom prompts for tailored analysis.

How SpiderSite Works

┌─────────────────────────────────────────────────────────────────────────┐
│ SpiderSite Flow │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. CRAWL PHASE │
│ ├── Check for sitemap.xml (fastest method) │
│ ├── Score URLs by relevance (contact, about, team pages first) │
│ ├── Auto-detect SPA (React/Vue/Angular) → use Playwright │
│ └── Crawl up to max_pages using selected strategy │
│ │
│ 2. EXTRACTION PHASE (No AI - Always runs) │
│ ├── Extract emails, phones, addresses │
│ ├── Find social media profiles (14 platforms) │
│ └── Generate markdown compendium of all content │
│ │
│ 3. AI ANALYSIS PHASE (Opt-in - ONE unified call) │
│ └── Combines ALL enabled features: │
│ ├── extract_team → Team members with titles/emails │
│ ├── extract_company_info → Company summary/services │
│ ├── extract_pain_points → Business challenges │
│ ├── Lead scoring (CHAMP) → If product/ICP provided │
│ └── custom_ai_prompt → Your custom analysis │
│ │
│ 4. RESPONSE │
│ └── Structured JSON with all extracted data │
│ │
└─────────────────────────────────────────────────────────────────────────┘

The 5 Request Types

SpiderSite supports 5 different levels of extraction, from basic scraping to full AI analysis:

TypeDescriptionAI UsedCost
1. Basic ScrapingURL → markdown compendium onlyNoFree
2. Contact ExtractionScrape + contacts/social mediaNoFree
3. AI Lead Intelligence+ team, company info, pain pointsYesAI tokens
4. CHAMP Lead Scoring+ lead scoring with product/ICPYesAI tokens
5. Custom AI Prompts+ your own analysis promptsYesAI tokens

Example 1: Basic Contact Extraction (No AI)

The simplest request - just provide a URL:

curl -X POST "https://spideriq.ai/api/v1/jobs/spiderSite/submit" \
-H "Authorization: Bearer $CLIENT_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"payload": {
"url": "https://example.com",
"max_pages": 5
}
}'

What you get:

  • Emails, phones, addresses
  • Social media links (14 platforms)
  • Markdown compendium (fit level)
  • No AI tokens used

Example 2: Full Lead Intelligence (AI Enabled)

Extract company info and team members:

{
"payload": {
"url": "https://techstartup.io",
"max_pages": 15,
"extract_team": true,
"extract_company_info": true,
"extract_pain_points": true
}
}

What you get:

  • All contact info
  • Company vitals (name, summary, industry, services, target audience)
  • Team members (names, titles, emails, LinkedIn)
  • Pain points analysis
  • Markdown compendium

Example 3: CHAMP Lead Scoring

Complete lead scoring with the CHAMP framework:

{
"payload": {
"url": "https://enterprise-target.com",
"max_pages": 20,
"extract_team": true,
"extract_company_info": true,
"extract_pain_points": true,
"product_description": "AI-powered sales automation platform that helps B2B teams close deals 3x faster",
"icp_description": "Mid-market B2B SaaS companies with 50-500 employees, $10M-$100M ARR"
}
}

What you get:

  • Everything from Example 2, plus:
  • CHAMP Analysis:
    • Challenges: Specific pain points matched to your solution
    • Authority: Decision makers and buying process
    • Money: Budget indicators and funding status
    • Prioritization: Urgency signals and priority level
  • ICP fit score (0-1)
  • Personalization hooks for outreach

Example 4: Custom AI Analysis (v2.10.0)

Extract specific information using your own prompts:

{
"payload": {
"url": "https://saas-company.com",
"max_pages": 10,
"custom_ai_prompt": {
"enabled": true,
"system_prompt": "You are a cybersecurity analyst specializing in SaaS platforms.",
"user_prompt": "Extract all security certifications, compliance frameworks, and data privacy practices mentioned on this website.",
"json_schema": {
"security_certifications": ["SOC 2", "ISO 27001"],
"compliance_frameworks": ["GDPR", "HIPAA"],
"data_privacy_summary": "string"
},
"model": "google/gemini-2.0-flash-exp:free",
"temperature": 0.1,
"max_tokens": 4000
}
}
}

Response includes:

{
"data": {
"custom_analysis": {
"security_certifications": ["SOC 2 Type II", "ISO 27001"],
"compliance_frameworks": ["GDPR", "CCPA", "HIPAA"],
"data_privacy_summary": "Company maintains strict data encryption..."
}
}
}

Example 5: Combined AI + Custom Prompt (ONE Call!)

All AI features in a single API call for maximum efficiency:

{
"payload": {
"url": "https://target-company.com",
"max_pages": 15,
"extract_team": true,
"extract_company_info": true,
"extract_pain_points": true,
"product_description": "HR automation platform",
"icp_description": "Companies with 100-1000 employees",
"custom_ai_prompt": {
"enabled": true,
"system_prompt": "You are a competitive intelligence analyst.",
"user_prompt": "Extract pricing information, key differentiators, and main competitors mentioned.",
"output_field_name": "competitive_intel",
"model": "google/gemini-2.0-flash-exp:free",
"temperature": 0.2,
"max_tokens": 6000
}
}
}

All extracted in ONE API call:

  • Team members
  • Company info
  • Pain points
  • Lead scoring (CHAMP)
  • Custom competitive intel

Example 6: Minimal Compendium for LLM Context

Optimize for RAG/LLM applications with minimal token usage:

{
"payload": {
"url": "https://content-heavy-site.com",
"max_pages": 30,
"compendium": {
"enabled": true,
"cleanup_level": "minimal",
"max_chars": 50000,
"remove_duplicates": true
}
}
}

Cleanup levels:

LevelSizeBest For
raw100%Full fidelity, archival
fit~60%General purpose (default)
citations~35%Academic format with sources
minimal~15%LLM consumption, token savings

Example 7: SPA-Heavy Site

For React/Vue/Angular sites that need JavaScript rendering:

{
"payload": {
"url": "https://react-dashboard.app",
"max_pages": 10,
"enable_spa": true,
"spa_timeout": 60,
"extract_company_info": true
}
}
tip

SPA detection is automatic by default. Increase spa_timeout for slow-loading sites.


Response Structure

{
"success": true,
"job_id": "uuid",
"type": "spiderSite",
"status": "completed",
"processing_time_seconds": 25.4,
"data": {
"url": "https://example.com",
"pages_crawled": 10,
"crawl_status": "success",

"emails": ["contact@example.com", "sales@example.com"],
"phones": ["+1-555-123-4567"],
"addresses": ["123 Main St, SF, CA"],

"linkedin": "https://linkedin.com/company/example",
"twitter": "https://twitter.com/example",
"facebook": null,
"instagram": null,
"youtube": null,
"tiktok": null,
"github": "https://github.com/example",
"pinterest": null,
"snapchat": null,
"reddit": null,
"medium": null,
"discord": null,
"whatsapp": null,
"telegram": null,

"company_vitals": {
"one_sentence_summary": "...",
"key_services": ["Service A", "Service B"],
"target_audience": "...",
"industry": "B2B SaaS"
},

"team_members": [
{
"name": "John Doe",
"title": "CEO",
"email": "john@example.com",
"linkedin": "https://linkedin.com/in/johndoe"
}
],

"pain_points": {
"inferred_challenges": ["Challenge 1", "Challenge 2"],
"recent_mentions": ["News item 1"]
},

"lead_scoring": {
"icp_fit_grade": "A",
"engagement_score": 85,
"lead_priority": "Hot",
"champ_breakdown": {
"challenges": "...",
"authority": "...",
"money": "...",
"prioritization": "..."
}
},

"custom_analysis": {
"your_custom_fields": "..."
},

"markdown_compendium": "# Company Name\n\n...",

"compendium": {
"available": true,
"storage_location": "inline",
"size_chars": 45000,
"cleanup_level": "fit"
},

"metadata": {
"crawl_strategy": "sitemap",
"spa_enabled": true,
"browser_rendering_available": true
}
}
}

Compendium Storage (SpiderMedia v2.14.0)

Compendiums are uploaded to your dedicated SpiderMedia bucket with permanent public URLs:

{
"data": {
"markdown_compendium": "# Company Name\n\n...",
"compendium": {
"available": true,
"storage_location": "spidermedia",
"download_url": "https://media.spideriq.ai/client-xxx/compendiums/job-uuid.md",
"filename": "compendiums/job-uuid.md",
"size_bytes": 45000,
"content_hash": "abc123..."
}
}
}
tip

Permanent URLs: SpiderMedia URLs never expire. No more 24-hour download windows!


Complete Workflow Example

Here's a complete workflow from submission to result retrieval:

import requests
import time

# Configuration
API_BASE = "https://spideriq.ai/api/v1"
CLIENT_TOKEN = "<your_client_id>:<your_api_key>:<your_api_secret>"
headers = {"Authorization": f"Bearer {CLIENT_TOKEN}"}

# Step 1: Submit job
submit_data = {
"payload": {
"url": "https://target-company.com",
"max_pages": 10,
"extract_company_info": True,
"extract_team": True
}
}

response = requests.post(
f"{API_BASE}/jobs/spiderSite/submit",
headers=headers,
json=submit_data
)
job_id = response.json()['job_id']
print(f"✓ Job submitted: {job_id}")

# Step 2: Poll for completion
max_wait = 120 # 2 minutes
start_time = time.time()

while time.time() - start_time < max_wait:
response = requests.get(
f"{API_BASE}/jobs/{job_id}/results",
headers=headers
)

result = response.json()

if result['status'] == 'completed':
print("✓ Job completed!")
data = result['data']

# Access extracted data
print(f"\nEmails: {data['emails']}")
print(f"Phones: {data['phones']}")
print(f"LinkedIn: {data['linkedin']}")

if data.get('company_vitals'):
print(f"\nCompany: {data['company_vitals']['one_sentence_summary']}")

if data.get('team_members'):
print(f"\nTeam Members: {len(data['team_members'])}")
for member in data['team_members']:
print(f" - {member['name']}: {member.get('title', 'N/A')}")

break

elif result['status'] == 'failed':
print(f"✗ Job failed: {result.get('error_message')}")
break

else:
print(f"⏳ Status: {result['status']}...")
time.sleep(3)

else:
print("✗ Timeout waiting for job to complete")

Best Practices

When to use AI features

Use AI features when:

  • Qualifying high-value leads
  • Building targeted outreach campaigns
  • Identifying decision makers
  • Scoring leads by ICP fit

Skip AI features for:

  • Bulk contact extraction
  • Budget-sensitive scraping
  • When you only need contact info
Optimizing crawl strategy

bestfirst (default): Best for most use cases - intelligent prioritization

Sitemap-first (automatic): Used automatically when sitemap.xml discovered

bfs: When you need broad coverage across sections

dfs: When you need deep coverage of specific sections

Choosing cleanup level
LevelUse Case
rawAcademic research, legal compliance
fitGeneral purpose (default)
citationsResearch documents with sources
minimalLLM/RAG applications
Custom AI prompt tips

Be specific: Clearly define what data you want extracted

Use json_schema: Helps the AI return structured data

Set output_field_name: Organize multiple custom analyses

Adjust temperature: Lower (0.1) for factual extraction, higher (0.5+) for creative analysis


Error Handling

URL Not Accessible

Error: "Failed to connect to target URL"

Causes:

  • Invalid URL
  • Site blocking bots
  • Site requires authentication

Solutions:

  • Verify URL is correct and publicly accessible
  • Check if site blocks automated access
Timeout

Error: "Page load timeout exceeded"

Causes:

  • Slow-loading site
  • Heavy JavaScript rendering

Solutions:

  • Increase timeout parameter (max 120s)
  • Increase spa_timeout for SPA sites
  • Reduce max_pages
Rate Limit Exceeded

Error: "Rate limit exceeded"

Solutions:

  • Implement delays between requests
  • Use exponential backoff
  • Contact support for higher limits

Limitations

warning

Authentication: SpiderSite cannot scrape pages requiring login

warning

CAPTCHAs: Sites with CAPTCHA protection cannot be scraped

info

robots.txt: SpiderSite respects robots.txt directives


Next Steps