SpiderSite Complete Guide
Overview
SpiderSite is an intelligent website crawler with AI-powered lead generation. It crawls websites, extracts contact information, and optionally applies AI analysis for company insights, team identification, and lead scoring.
Version 2.14.0: Compendiums now stored in SpiderMedia with permanent public URLs.
Version 2.10.0: All AI features now combine into a single efficient API call, including custom prompts for tailored analysis.
How SpiderSite Works
┌─────────────────────────────────────────────────────────────────────────┐
│ SpiderSite Flow │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. CRAWL PHASE │
│ ├── Check for sitemap.xml (fastest method) │
│ ├── Score URLs by relevance (contact, about, team pages first) │
│ ├── Auto-detect SPA (React/Vue/Angular) → use Playwright │
│ └── Crawl up to max_pages using selected strategy │
│ │
│ 2. EXTRACTION PHASE (No AI - Always runs) │
│ ├── Extract emails, phones, addresses │
│ ├── Find social media profiles (14 platforms) │
│ └── Generate markdown compendium of all content │
│ │
│ 3. AI ANALYSIS PHASE (Opt-in - ONE unified call) │
│ └── Combines ALL enabled features: │
│ ├── extract_team → Team members with titles/emails │
│ ├── extract_company_info → Company summary/services │
│ ├── extract_pain_points → Business challenges │
│ ├── Lead scoring (CHAMP) → If product/ICP provided │
│ └── custom_ai_prompt → Your custom analysis │
│ │
│ 4. RESPONSE │
│ └── Structured JSON with all extracted data │
│ │
└─────────────────────────────────────────────────────────────────────────┘
The 5 Request Types
SpiderSite supports 5 different levels of extraction, from basic scraping to full AI analysis:
| Type | Description | AI Used | Cost |
|---|---|---|---|
| 1. Basic Scraping | URL → markdown compendium only | No | Free |
| 2. Contact Extraction | Scrape + contacts/social media | No | Free |
| 3. AI Lead Intelligence | + team, company info, pain points | Yes | AI tokens |
| 4. CHAMP Lead Scoring | + lead scoring with product/ICP | Yes | AI tokens |
| 5. Custom AI Prompts | + your own analysis prompts | Yes | AI tokens |
Example 1: Basic Contact Extraction (No AI)
The simplest request - just provide a URL:
- cURL
- Python
- JavaScript
curl -X POST "https://spideriq.ai/api/v1/jobs/spiderSite/submit" \
-H "Authorization: Bearer $CLIENT_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"payload": {
"url": "https://example.com",
"max_pages": 5
}
}'
import requests
response = requests.post(
"https://spideriq.ai/api/v1/jobs/spiderSite/submit",
headers={"Authorization": f"Bearer {CLIENT_TOKEN}"},
json={
"payload": {
"url": "https://example.com",
"max_pages": 5
}
}
)
job = response.json()
print(f"Job ID: {job['job_id']}")
const response = await fetch(
'https://spideriq.ai/api/v1/jobs/spiderSite/submit',
{
method: 'POST',
headers: {
'Authorization': `Bearer ${CLIENT_TOKEN}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
payload: {
url: 'https://example.com',
max_pages: 5
}
})
}
);
const job = await response.json();
console.log('Job ID:', job.job_id);
What you get:
- Emails, phones, addresses
- Social media links (14 platforms)
- Markdown compendium (fit level)
- No AI tokens used
Example 2: Full Lead Intelligence (AI Enabled)
Extract company info and team members:
{
"payload": {
"url": "https://techstartup.io",
"max_pages": 15,
"extract_team": true,
"extract_company_info": true,
"extract_pain_points": true
}
}
What you get:
- All contact info
- Company vitals (name, summary, industry, services, target audience)
- Team members (names, titles, emails, LinkedIn)
- Pain points analysis
- Markdown compendium
Example 3: CHAMP Lead Scoring
Complete lead scoring with the CHAMP framework:
{
"payload": {
"url": "https://enterprise-target.com",
"max_pages": 20,
"extract_team": true,
"extract_company_info": true,
"extract_pain_points": true,
"product_description": "AI-powered sales automation platform that helps B2B teams close deals 3x faster",
"icp_description": "Mid-market B2B SaaS companies with 50-500 employees, $10M-$100M ARR"
}
}
What you get:
- Everything from Example 2, plus:
- CHAMP Analysis:
- Challenges: Specific pain points matched to your solution
- Authority: Decision makers and buying process
- Money: Budget indicators and funding status
- Prioritization: Urgency signals and priority level
- ICP fit score (0-1)
- Personalization hooks for outreach
Example 4: Custom AI Analysis (v2.10.0)
Extract specific information using your own prompts:
{
"payload": {
"url": "https://saas-company.com",
"max_pages": 10,
"custom_ai_prompt": {
"enabled": true,
"system_prompt": "You are a cybersecurity analyst specializing in SaaS platforms.",
"user_prompt": "Extract all security certifications, compliance frameworks, and data privacy practices mentioned on this website.",
"json_schema": {
"security_certifications": ["SOC 2", "ISO 27001"],
"compliance_frameworks": ["GDPR", "HIPAA"],
"data_privacy_summary": "string"
},
"model": "google/gemini-2.0-flash-exp:free",
"temperature": 0.1,
"max_tokens": 4000
}
}
}
Response includes:
{
"data": {
"custom_analysis": {
"security_certifications": ["SOC 2 Type II", "ISO 27001"],
"compliance_frameworks": ["GDPR", "CCPA", "HIPAA"],
"data_privacy_summary": "Company maintains strict data encryption..."
}
}
}
Example 5: Combined AI + Custom Prompt (ONE Call!)
All AI features in a single API call for maximum efficiency:
{
"payload": {
"url": "https://target-company.com",
"max_pages": 15,
"extract_team": true,
"extract_company_info": true,
"extract_pain_points": true,
"product_description": "HR automation platform",
"icp_description": "Companies with 100-1000 employees",
"custom_ai_prompt": {
"enabled": true,
"system_prompt": "You are a competitive intelligence analyst.",
"user_prompt": "Extract pricing information, key differentiators, and main competitors mentioned.",
"output_field_name": "competitive_intel",
"model": "google/gemini-2.0-flash-exp:free",
"temperature": 0.2,
"max_tokens": 6000
}
}
}
All extracted in ONE API call:
- Team members
- Company info
- Pain points
- Lead scoring (CHAMP)
- Custom competitive intel
Example 6: Minimal Compendium for LLM Context
Optimize for RAG/LLM applications with minimal token usage:
{
"payload": {
"url": "https://content-heavy-site.com",
"max_pages": 30,
"compendium": {
"enabled": true,
"cleanup_level": "minimal",
"max_chars": 50000,
"remove_duplicates": true
}
}
}
Cleanup levels:
| Level | Size | Best For |
|---|---|---|
raw | 100% | Full fidelity, archival |
fit | ~60% | General purpose (default) |
citations | ~35% | Academic format with sources |
minimal | ~15% | LLM consumption, token savings |
Example 7: SPA-Heavy Site
For React/Vue/Angular sites that need JavaScript rendering:
{
"payload": {
"url": "https://react-dashboard.app",
"max_pages": 10,
"enable_spa": true,
"spa_timeout": 60,
"extract_company_info": true
}
}
SPA detection is automatic by default. Increase spa_timeout for slow-loading sites.
Response Structure
{
"success": true,
"job_id": "uuid",
"type": "spiderSite",
"status": "completed",
"processing_time_seconds": 25.4,
"data": {
"url": "https://example.com",
"pages_crawled": 10,
"crawl_status": "success",
"emails": ["contact@example.com", "sales@example.com"],
"phones": ["+1-555-123-4567"],
"addresses": ["123 Main St, SF, CA"],
"linkedin": "https://linkedin.com/company/example",
"twitter": "https://twitter.com/example",
"facebook": null,
"instagram": null,
"youtube": null,
"tiktok": null,
"github": "https://github.com/example",
"pinterest": null,
"snapchat": null,
"reddit": null,
"medium": null,
"discord": null,
"whatsapp": null,
"telegram": null,
"company_vitals": {
"one_sentence_summary": "...",
"key_services": ["Service A", "Service B"],
"target_audience": "...",
"industry": "B2B SaaS"
},
"team_members": [
{
"name": "John Doe",
"title": "CEO",
"email": "john@example.com",
"linkedin": "https://linkedin.com/in/johndoe"
}
],
"pain_points": {
"inferred_challenges": ["Challenge 1", "Challenge 2"],
"recent_mentions": ["News item 1"]
},
"lead_scoring": {
"icp_fit_grade": "A",
"engagement_score": 85,
"lead_priority": "Hot",
"champ_breakdown": {
"challenges": "...",
"authority": "...",
"money": "...",
"prioritization": "..."
}
},
"custom_analysis": {
"your_custom_fields": "..."
},
"markdown_compendium": "# Company Name\n\n...",
"compendium": {
"available": true,
"storage_location": "inline",
"size_chars": 45000,
"cleanup_level": "fit"
},
"metadata": {
"crawl_strategy": "sitemap",
"spa_enabled": true,
"browser_rendering_available": true
}
}
}
Compendium Storage (SpiderMedia v2.14.0)
Compendiums are uploaded to your dedicated SpiderMedia bucket with permanent public URLs:
{
"data": {
"markdown_compendium": "# Company Name\n\n...",
"compendium": {
"available": true,
"storage_location": "spidermedia",
"download_url": "https://media.spideriq.ai/client-xxx/compendiums/job-uuid.md",
"filename": "compendiums/job-uuid.md",
"size_bytes": 45000,
"content_hash": "abc123..."
}
}
}
Permanent URLs: SpiderMedia URLs never expire. No more 24-hour download windows!
Complete Workflow Example
Here's a complete workflow from submission to result retrieval:
import requests
import time
# Configuration
API_BASE = "https://spideriq.ai/api/v1"
CLIENT_TOKEN = "<your_client_id>:<your_api_key>:<your_api_secret>"
headers = {"Authorization": f"Bearer {CLIENT_TOKEN}"}
# Step 1: Submit job
submit_data = {
"payload": {
"url": "https://target-company.com",
"max_pages": 10,
"extract_company_info": True,
"extract_team": True
}
}
response = requests.post(
f"{API_BASE}/jobs/spiderSite/submit",
headers=headers,
json=submit_data
)
job_id = response.json()['job_id']
print(f"✓ Job submitted: {job_id}")
# Step 2: Poll for completion
max_wait = 120 # 2 minutes
start_time = time.time()
while time.time() - start_time < max_wait:
response = requests.get(
f"{API_BASE}/jobs/{job_id}/results",
headers=headers
)
result = response.json()
if result['status'] == 'completed':
print("✓ Job completed!")
data = result['data']
# Access extracted data
print(f"\nEmails: {data['emails']}")
print(f"Phones: {data['phones']}")
print(f"LinkedIn: {data['linkedin']}")
if data.get('company_vitals'):
print(f"\nCompany: {data['company_vitals']['one_sentence_summary']}")
if data.get('team_members'):
print(f"\nTeam Members: {len(data['team_members'])}")
for member in data['team_members']:
print(f" - {member['name']}: {member.get('title', 'N/A')}")
break
elif result['status'] == 'failed':
print(f"✗ Job failed: {result.get('error_message')}")
break
else:
print(f"⏳ Status: {result['status']}...")
time.sleep(3)
else:
print("✗ Timeout waiting for job to complete")
Best Practices
When to use AI features
Use AI features when:
- Qualifying high-value leads
- Building targeted outreach campaigns
- Identifying decision makers
- Scoring leads by ICP fit
Skip AI features for:
- Bulk contact extraction
- Budget-sensitive scraping
- When you only need contact info
Optimizing crawl strategy
bestfirst (default): Best for most use cases - intelligent prioritization
Sitemap-first (automatic): Used automatically when sitemap.xml discovered
bfs: When you need broad coverage across sections
dfs: When you need deep coverage of specific sections
Choosing cleanup level
| Level | Use Case |
|---|---|
raw | Academic research, legal compliance |
fit | General purpose (default) |
citations | Research documents with sources |
minimal | LLM/RAG applications |
Custom AI prompt tips
Be specific: Clearly define what data you want extracted
Use json_schema: Helps the AI return structured data
Set output_field_name: Organize multiple custom analyses
Adjust temperature: Lower (0.1) for factual extraction, higher (0.5+) for creative analysis
Error Handling
URL Not Accessible
Error: "Failed to connect to target URL"
Causes:
- Invalid URL
- Site blocking bots
- Site requires authentication
Solutions:
- Verify URL is correct and publicly accessible
- Check if site blocks automated access
Timeout
Error: "Page load timeout exceeded"
Causes:
- Slow-loading site
- Heavy JavaScript rendering
Solutions:
- Increase
timeoutparameter (max 120s) - Increase
spa_timeoutfor SPA sites - Reduce
max_pages
Rate Limit Exceeded
Error: "Rate limit exceeded"
Solutions:
- Implement delays between requests
- Use exponential backoff
- Contact support for higher limits
Limitations
Authentication: SpiderSite cannot scrape pages requiring login
CAPTCHAs: Sites with CAPTCHA protection cannot be scraped
robots.txt: SpiderSite respects robots.txt directives