SpiderSite Complete Guide

Overview

SpiderSite is an intelligent website crawler with AI-powered lead generation. It crawls websites, extracts contact information, and optionally applies AI analysis for company insights, team identification, and lead scoring.

info

Version 2.14.0: Compendiums now stored in SpiderMedia with permanent public URLs.

info

Version 2.10.0: All AI features now combine into a single efficient API call, including custom prompts for tailored analysis.

How SpiderSite Works

┌─────────────────────────────────────────────────────────────────────────┐
│                         SpiderSite Flow                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. CRAWL PHASE                                                          │
│     ├── Check for sitemap.xml (fastest method)                          │
│     ├── Score URLs by relevance (contact, about, team pages first)      │
│     ├── Auto-detect SPA (React/Vue/Angular) → use Playwright            │
│     └── Crawl up to max_pages using selected strategy                   │
│                                                                          │
│  2. EXTRACTION PHASE (No AI - Always runs)                              │
│     ├── Extract emails, phones, addresses                               │
│     ├── Find social media profiles (14 platforms)                       │
│     └── Generate markdown compendium of all content                     │
│                                                                          │
│  3. AI ANALYSIS PHASE (Opt-in - ONE unified call)                       │
│     └── Combines ALL enabled features:                                   │
│         ├── extract_team → Team members with titles/emails              │
│         ├── extract_company_info → Company summary/services             │
│         ├── extract_pain_points → Business challenges                   │
│         ├── Lead scoring (CHAMP) → If product/ICP provided              │
│         └── custom_ai_prompt → Your custom analysis                     │
│                                                                          │
│  4. RESPONSE                                                             │
│     └── Structured JSON with all extracted data                         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The 5 Request Types

SpiderSite supports 5 different levels of extraction, from basic scraping to full AI analysis:

Type	Description	AI Used	Cost
1. Basic Scraping	URL → markdown compendium only	No	Free
2. Contact Extraction	Scrape + contacts/social media	No	Free
3. AI Lead Intelligence	+ team, company info, pain points	Yes	AI tokens
4. CHAMP Lead Scoring	+ lead scoring with product/ICP	Yes	AI tokens
5. Custom AI Prompts	+ your own analysis prompts	Yes	AI tokens

Example 1: Basic Contact Extraction (No AI)

The simplest request - just provide a URL:

cURL
Python
JavaScript

curl -X POST "https://spideriq.ai/api/v1/jobs/spiderSite/submit" \
  -H "Authorization: Bearer $CLIENT_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "payload": {
      "url": "https://example.com",
      "max_pages": 5
    }
  }'

import requests

response = requests.post(
    "https://spideriq.ai/api/v1/jobs/spiderSite/submit",
    headers={"Authorization": f"Bearer {CLIENT_TOKEN}"},
    json={
        "payload": {
            "url": "https://example.com",
            "max_pages": 5
        }
    }
)

job = response.json()
print(f"Job ID: {job['job_id']}")

const response = await fetch(
  'https://spideriq.ai/api/v1/jobs/spiderSite/submit',
  {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${CLIENT_TOKEN}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      payload: {
        url: 'https://example.com',
        max_pages: 5
      }
    })
  }
);

const job = await response.json();
console.log('Job ID:', job.job_id);

What you get:

Emails, phones, addresses
Social media links (14 platforms)
Markdown compendium (fit level)
No AI tokens used

Example 2: Full Lead Intelligence (AI Enabled)

Extract company info and team members:

{
  "payload": {
    "url": "https://techstartup.io",
    "max_pages": 15,
    "extract_team": true,
    "extract_company_info": true,
    "extract_pain_points": true
  }
}

What you get:

All contact info
Company vitals (name, summary, industry, services, target audience)
Team members (names, titles, emails, LinkedIn)
Pain points analysis
Markdown compendium

Example 3: CHAMP Lead Scoring

Complete lead scoring with the CHAMP framework:

{
  "payload": {
    "url": "https://enterprise-target.com",
    "max_pages": 20,
    "extract_team": true,
    "extract_company_info": true,
    "extract_pain_points": true,
    "product_description": "AI-powered sales automation platform that helps B2B teams close deals 3x faster",
    "icp_description": "Mid-market B2B SaaS companies with 50-500 employees, $10M-$100M ARR"
  }
}

What you get:

Everything from Example 2, plus:
CHAMP Analysis:
- Challenges: Specific pain points matched to your solution
- Authority: Decision makers and buying process
- Money: Budget indicators and funding status
- Prioritization: Urgency signals and priority level
ICP fit score (0-1)
Personalization hooks for outreach

Example 4: Custom AI Analysis (v2.10.0)

Extract specific information using your own prompts:

{
  "payload": {
    "url": "https://saas-company.com",
    "max_pages": 10,
    "custom_ai_prompt": {
      "enabled": true,
      "system_prompt": "You are a cybersecurity analyst specializing in SaaS platforms.",
      "user_prompt": "Extract all security certifications, compliance frameworks, and data privacy practices mentioned on this website.",
      "json_schema": {
        "security_certifications": ["SOC 2", "ISO 27001"],
        "compliance_frameworks": ["GDPR", "HIPAA"],
        "data_privacy_summary": "string"
      },
      "model": "google/gemini-2.0-flash-exp:free",
      "temperature": 0.1,
      "max_tokens": 4000
    }
  }
}

Response includes:

{
  "data": {
    "custom_analysis": {
      "security_certifications": ["SOC 2 Type II", "ISO 27001"],
      "compliance_frameworks": ["GDPR", "CCPA", "HIPAA"],
      "data_privacy_summary": "Company maintains strict data encryption..."
    }
  }
}

Example 5: Combined AI + Custom Prompt (ONE Call!)

All AI features in a single API call for maximum efficiency:

{
  "payload": {
    "url": "https://target-company.com",
    "max_pages": 15,
    "extract_team": true,
    "extract_company_info": true,
    "extract_pain_points": true,
    "product_description": "HR automation platform",
    "icp_description": "Companies with 100-1000 employees",
    "custom_ai_prompt": {
      "enabled": true,
      "system_prompt": "You are a competitive intelligence analyst.",
      "user_prompt": "Extract pricing information, key differentiators, and main competitors mentioned.",
      "output_field_name": "competitive_intel",
      "model": "google/gemini-2.0-flash-exp:free",
      "temperature": 0.2,
      "max_tokens": 6000
    }
  }
}

All extracted in ONE API call:

Team members
Company info
Pain points
Lead scoring (CHAMP)
Custom competitive intel

Example 6: Minimal Compendium for LLM Context

Optimize for RAG/LLM applications with minimal token usage:

{
  "payload": {
    "url": "https://content-heavy-site.com",
    "max_pages": 30,
    "compendium": {
      "enabled": true,
      "cleanup_level": "minimal",
      "max_chars": 50000,
      "remove_duplicates": true
    }
  }
}

Cleanup levels:

Level	Size	Best For
`raw`	100%	Full fidelity, archival
`fit`	~60%	General purpose (default)
`citations`	~35%	Academic format with sources
`minimal`	~15%	LLM consumption, token savings

Example 7: SPA-Heavy Site

For React/Vue/Angular sites that need JavaScript rendering:

{
  "payload": {
    "url": "https://react-dashboard.app",
    "max_pages": 10,
    "enable_spa": true,
    "spa_timeout": 60,
    "extract_company_info": true
  }
}

tip

SPA detection is automatic by default. Increase spa_timeout for slow-loading sites.

Response Structure

{
  "success": true,
  "job_id": "uuid",
  "type": "spiderSite",
  "status": "completed",
  "processing_time_seconds": 25.4,
  "data": {
    "url": "https://example.com",
    "pages_crawled": 10,
    "crawl_status": "success",

    "emails": ["contact@example.com", "sales@example.com"],
    "phones": ["+1-555-123-4567"],
    "addresses": ["123 Main St, SF, CA"],

    "linkedin": "https://linkedin.com/company/example",
    "twitter": "https://twitter.com/example",
    "facebook": null,
    "instagram": null,
    "youtube": null,
    "tiktok": null,
    "github": "https://github.com/example",
    "pinterest": null,
    "snapchat": null,
    "reddit": null,
    "medium": null,
    "discord": null,
    "whatsapp": null,
    "telegram": null,

    "company_vitals": {
      "one_sentence_summary": "...",
      "key_services": ["Service A", "Service B"],
      "target_audience": "...",
      "industry": "B2B SaaS"
    },

    "team_members": [
      {
        "name": "John Doe",
        "title": "CEO",
        "email": "john@example.com",
        "linkedin": "https://linkedin.com/in/johndoe"
      }
    ],

    "pain_points": {
      "inferred_challenges": ["Challenge 1", "Challenge 2"],
      "recent_mentions": ["News item 1"]
    },

    "lead_scoring": {
      "icp_fit_grade": "A",
      "engagement_score": 85,
      "lead_priority": "Hot",
      "champ_breakdown": {
        "challenges": "...",
        "authority": "...",
        "money": "...",
        "prioritization": "..."
      }
    },

    "custom_analysis": {
      "your_custom_fields": "..."
    },

    "markdown_compendium": "# Company Name\n\n...",

    "compendium": {
      "available": true,
      "storage_location": "inline",
      "size_chars": 45000,
      "cleanup_level": "fit"
    },

    "metadata": {
      "crawl_strategy": "sitemap",
      "spa_enabled": true,
      "browser_rendering_available": true
    }
  }
}

Compendium Storage (SpiderMedia v2.14.0)

Compendiums are uploaded to your dedicated SpiderMedia bucket with permanent public URLs:

{
  "data": {
    "markdown_compendium": "# Company Name\n\n...",
    "compendium": {
      "available": true,
      "storage_location": "spidermedia",
      "download_url": "https://media.spideriq.ai/client-xxx/compendiums/job-uuid.md",
      "filename": "compendiums/job-uuid.md",
      "size_bytes": 45000,
      "content_hash": "abc123..."
    }
  }
}

tip

Permanent URLs: SpiderMedia URLs never expire. No more 24-hour download windows!

Complete Workflow Example

Here's a complete workflow from submission to result retrieval:

import requests
import time

# Configuration
API_BASE = "https://spideriq.ai/api/v1"
CLIENT_TOKEN = "<your_client_id>:<your_api_key>:<your_api_secret>"
headers = {"Authorization": f"Bearer {CLIENT_TOKEN}"}

# Step 1: Submit job
submit_data = {
    "payload": {
        "url": "https://target-company.com",
        "max_pages": 10,
        "extract_company_info": True,
        "extract_team": True
    }
}

response = requests.post(
    f"{API_BASE}/jobs/spiderSite/submit",
    headers=headers,
    json=submit_data
)
job_id = response.json()['job_id']
print(f"✓ Job submitted: {job_id}")

# Step 2: Poll for completion
max_wait = 120  # 2 minutes
start_time = time.time()

while time.time() - start_time < max_wait:
    response = requests.get(
        f"{API_BASE}/jobs/{job_id}/results",
        headers=headers
    )

    result = response.json()

    if result['status'] == 'completed':
        print("✓ Job completed!")
        data = result['data']

        # Access extracted data
        print(f"\nEmails: {data['emails']}")
        print(f"Phones: {data['phones']}")
        print(f"LinkedIn: {data['linkedin']}")

        if data.get('company_vitals'):
            print(f"\nCompany: {data['company_vitals']['one_sentence_summary']}")

        if data.get('team_members'):
            print(f"\nTeam Members: {len(data['team_members'])}")
            for member in data['team_members']:
                print(f"  - {member['name']}: {member.get('title', 'N/A')}")

        break

    elif result['status'] == 'failed':
        print(f"✗ Job failed: {result.get('error_message')}")
        break

    else:
        print(f"⏳ Status: {result['status']}...")
        time.sleep(3)

else:
    print("✗ Timeout waiting for job to complete")

Best Practices

When to use AI features

Use AI features when:

Qualifying high-value leads
Building targeted outreach campaigns
Identifying decision makers
Scoring leads by ICP fit

Skip AI features for:

Bulk contact extraction
Budget-sensitive scraping
When you only need contact info

Optimizing crawl strategy

bestfirst (default): Best for most use cases - intelligent prioritization

Sitemap-first (automatic): Used automatically when sitemap.xml discovered

bfs: When you need broad coverage across sections

dfs: When you need deep coverage of specific sections

Choosing cleanup level

Level	Use Case
`raw`	Academic research, legal compliance
`fit`	General purpose (default)
`citations`	Research documents with sources
`minimal`	LLM/RAG applications

Custom AI prompt tips

Be specific: Clearly define what data you want extracted

Use json_schema: Helps the AI return structured data

Set output_field_name: Organize multiple custom analyses

Adjust temperature: Lower (0.1) for factual extraction, higher (0.5+) for creative analysis

Error Handling

URL Not Accessible

Error: "Failed to connect to target URL"

Causes:

Invalid URL
Site blocking bots
Site requires authentication

Solutions:

Verify URL is correct and publicly accessible
Check if site blocks automated access

Timeout

Error: "Page load timeout exceeded"

Causes:

Slow-loading site
Heavy JavaScript rendering

Solutions:

Increase timeout parameter (max 120s)
Increase spa_timeout for SPA sites
Reduce max_pages

Rate Limit Exceeded

Error: "Rate limit exceeded"

Solutions:

Implement delays between requests
Use exponential backoff
Contact support for higher limits

Limitations

warning

Authentication: SpiderSite cannot scrape pages requiring login

warning

CAPTCHAs: Sites with CAPTCHA protection cannot be scraped

info

robots.txt: SpiderSite respects robots.txt directives

Next Steps

Complete parameter documentation

Monitor processing progress

Retrieve crawled data

Scrape Google Maps businesses

SpiderSite Complete Guide

Overview

How SpiderSite Works

The 5 Request Types

Example 1: Basic Contact Extraction (No AI)

Example 2: Full Lead Intelligence (AI Enabled)

Example 3: CHAMP Lead Scoring

Example 4: Custom AI Analysis (v2.10.0)

Example 5: Combined AI + Custom Prompt (ONE Call!)

Example 6: Minimal Compendium for LLM Context

Example 7: SPA-Heavy Site

Response Structure

Compendium Storage (SpiderMedia v2.14.0)

Complete Workflow Example

Best Practices

Error Handling

Limitations

Next Steps

API Reference

Check Job Status

Get Results

SpiderMaps

Overview​

How SpiderSite Works​

The 5 Request Types​

Example 1: Basic Contact Extraction (No AI)​

Example 2: Full Lead Intelligence (AI Enabled)​

Example 3: CHAMP Lead Scoring​

Example 4: Custom AI Analysis (v2.10.0)​

Example 5: Combined AI + Custom Prompt (ONE Call!)​

Example 6: Minimal Compendium for LLM Context​

Example 7: SPA-Heavy Site​

Response Structure​

Compendium Storage (SpiderMedia v2.14.0)​

Complete Workflow Example​

Best Practices​

Error Handling​

Limitations​

Next Steps​

API Reference

Check Job Status

Get Results

SpiderMaps

Overview

How SpiderSite Works

The 5 Request Types

Example 1: Basic Contact Extraction (No AI)

Example 2: Full Lead Intelligence (AI Enabled)

Example 3: CHAMP Lead Scoring

Example 4: Custom AI Analysis (v2.10.0)

Example 5: Combined AI + Custom Prompt (ONE Call!)

Example 6: Minimal Compendium for LLM Context

Example 7: SPA-Heavy Site

Response Structure

Compendium Storage (SpiderMedia v2.14.0)

Complete Workflow Example

Best Practices

Error Handling

Limitations

Next Steps