Submit SpiderSite Job
/api/v1/jobs/spiderSite/submitOverviewβ
Submit a SpiderSite job to crawl websites with contact extraction, AI-powered company analysis, team member identification, and CHAMP lead scoring.
Version 2.14.0: Compendiums now stored in SpiderMedia (per-client storage) with permanent public URLs. No more 24-hour expiration!
Version 2.10.0: Custom AI Prompts for tailored analysis, AI Context Engine with markdown compendiums, SPA auto-detection, sitemap-first crawling, and multilingual support (36+ languages).
Key Featuresβ
Smart Crawling
Sitemap-first with intelligent page prioritization
Contact Extraction
Emails, phones, addresses, 14 social platforms
AI Analysis
Company vitals, team members, pain points
Lead Scoring
CHAMP framework with ICP fit scoring
Custom AI Prompts
Your own prompts for tailored analysis (v2.10.0)
SpiderMedia Storage
Compendiums stored with permanent public URLs (v2.14.0)
Request Bodyβ
payloadobjectrequiredJob configuration payload
priorityintegerdefault: 0Job priority (0-10, higher = processed first)
Responseβ
job_idstringUnique job identifier (UUID format)
typestringAlways spiderSite for this endpoint
statusstringInitial job status (always queued)
created_atstringJob creation timestamp (ISO 8601)
from_cachebooleanWhether this job was deduplicated from cache (24-hour TTL)
messagestringConfirmation message
Request Examplesβ
- Minimal
- With AI Features
- Full CHAMP Scoring
- Compendium Minimal
- Compendium Disabled
- SPA Site
- Multilingual
- High Priority
- Partial Compendium
- Full Configuration
- Custom AI Analysis
- Combined AI + Custom
Most basic request - only URL (contact extraction only, no AI):
- cURL
- Python
- JavaScript
curl -X POST https://spideriq.ai/api/v1/jobs/spiderSite/submit \
-H "Authorization: Bearer <your_token>" \
-H "Content-Type: application/json" \
-d '{
"payload": {
"url": "https://example.com"
}
}'
import requests
response = requests.post(
"https://spideriq.ai/api/v1/jobs/spiderSite/submit",
headers={"Authorization": "Bearer <your_token>"},
json={
"payload": {
"url": "https://example.com"
}
}
)
job = response.json()
print(f"Job ID: {job['job_id']}")
const response = await fetch(
'https://spideriq.ai/api/v1/jobs/spiderSite/submit',
{
method: 'POST',
headers: {
'Authorization': 'Bearer <your_token>',
'Content-Type': 'application/json'
},
body: JSON.stringify({
payload: {
url: 'https://example.com'
}
})
}
);
const job = await response.json();
console.log('Job ID:', job.job_id);
What you get:
- Contact info (emails, phones, addresses)
- 14 social media platforms
- Markdown compendium (fit level)
- No AI tokens used (0 cost)
Extract company info and team members:
{
"payload": {
"url": "https://techstart.com",
"max_pages": 15,
"extract_company_info": true,
"extract_team": true,
"extract_pain_points": true
},
"priority": 5
}
What you get:
- All contact info
- Company vitals (name, summary, industry, services, target audience)
- Team members (names, titles, emails, LinkedIn)
- Pain points analysis
- Markdown compendium
- AI tokens: ~1,500 tokens total
Complete lead scoring with CHAMP framework:
{
"payload": {
"url": "https://techstart.com",
"max_pages": 20,
"extract_company_info": true,
"extract_team": true,
"extract_pain_points": true,
"product_description": "AI-powered customer support automation platform that reduces ticket resolution time by 60% using intelligent routing and automated responses.",
"icp_description": "B2B SaaS companies with 50-500 employees, experiencing rapid growth, struggling with support team scalability, budget >$50k/year for support tools."
},
"priority": 8
}
What you get:
- All contact info + company info + team + pain points
- CHAMP Analysis:
- Challenges: Specific pain points matched to your solution
- Authority: Decision makers and buying process
- Money: Budget indicators and funding status
- Prioritization: Urgency signals and priority level
- ICP Fit Score: 0-1 score indicating how well they match your ICP
- Personalization hooks for outreach
- AI tokens: ~3,000 tokens total
Aggressive token optimization (70% savings):
{
"payload": {
"url": "https://example.com",
"max_pages": 10,
"compendium": {
"enabled": true,
"cleanup_level": "minimal",
"max_chars": 50000,
"remove_duplicates": true
}
}
}
Use case: Feeding content to LLMs with limited context windows
Contact extraction only, no markdown:
{
"payload": {
"url": "https://example.com",
"compendium": {
"enabled": false
}
}
}
Use case: When you only need contact info, not content
JavaScript-heavy site (React/Vue/Angular):
{
"payload": {
"url": "https://modern-spa.com",
"max_pages": 10,
"enable_spa": true,
"spa_timeout": 45
}
}
Auto-detection: SPA rendering is automatic when JavaScript rendering is detected
German website with localized target pages:
{
"payload": {
"url": "https://deutsche-firma.de",
"max_pages": 15,
"target_pages": ["kontakt", "ΓΌber-uns", "team", "news"],
"extract_company_info": true
}
}
Supported languages: 36+ European languages automatically detected
Urgent job processing:
{
"payload": {
"url": "https://urgent-lead.com",
"max_pages": 10,
"extract_company_info": true,
"extract_team": true
},
"priority": 10
}
Priority range: 0-10 (higher = processed first)
Override only specific compendium settings:
{
"payload": {
"url": "https://example.com",
"compendium": {
"cleanup_level": "citations",
"max_chars": 200000
}
}
}
Default merge: Missing fields use defaults (enabled: true, remove_duplicates: true, etc.)
Merge behavior: Partial configs are merged with defaults. You don't need to specify all fields.
All parameters specified:
{
"payload": {
"url": "https://enterprise-site.com",
"max_pages": 50,
"crawl_strategy": "bestfirst",
"target_pages": ["contact", "about", "team", "leadership", "news", "blog", "careers"],
"enable_spa": true,
"spa_timeout": 60,
"timeout": 45,
"extract_team": true,
"extract_company_info": true,
"extract_pain_points": true,
"product_description": "Enterprise analytics platform for Fortune 500 companies...",
"icp_description": "Enterprises with >1000 employees, data-driven culture, budget >$500k/year...",
"compendium": {
"enabled": true,
"max_chars": 500000,
"cleanup_level": "fit",
"separator": "\n\n---\n\n",
"include_in_response": true,
"remove_duplicates": true,
"priority_sections": ["main", "article", "content", "section"]
}
},
"priority": 10
}
Use case: Enterprise-level lead generation with maximum data extraction
Extract specific information using your own prompts (v2.10.0):
{
"payload": {
"url": "https://saas-company.com",
"max_pages": 10,
"custom_ai_prompt": {
"enabled": true,
"system_prompt": "You are a cybersecurity analyst specializing in SaaS platforms.",
"user_prompt": "Extract all security certifications, compliance frameworks, and data privacy practices mentioned on this website.",
"json_schema": {
"security_certifications": ["SOC 2", "ISO 27001"],
"compliance_frameworks": ["GDPR", "HIPAA"],
"data_privacy_summary": "string"
},
"model": "google/gemini-2.0-flash-exp:free",
"temperature": 0.1,
"max_tokens": 4000
}
}
}
Response includes:
{
"data": {
"custom_analysis": {
"security_certifications": ["SOC 2 Type II", "ISO 27001"],
"compliance_frameworks": ["GDPR", "CCPA", "HIPAA"],
"data_privacy_summary": "Company maintains strict data encryption..."
}
}
}
Use cases:
- Security/compliance audits
- Competitive intelligence
- Industry-specific data extraction
- Technical stack analysis
All AI features in ONE efficient API call (v2.10.0):
{
"payload": {
"url": "https://target-company.com",
"max_pages": 15,
"extract_team": true,
"extract_company_info": true,
"extract_pain_points": true,
"product_description": "HR automation platform",
"icp_description": "Companies with 100-1000 employees",
"custom_ai_prompt": {
"enabled": true,
"system_prompt": "You are a competitive intelligence analyst.",
"user_prompt": "Extract pricing information, key differentiators, and main competitors mentioned.",
"output_field_name": "competitive_intel",
"model": "google/gemini-2.0-flash-exp:free",
"temperature": 0.2,
"max_tokens": 6000
}
}
}
All extracted in ONE API call:
- Contact info (emails, phones, social)
- Team members
- Company info
- Pain points
- Lead scoring (CHAMP)
- Custom competitive intelligence
Why this is efficient: All AI features combine into a single API request, reducing latency and cost.
Example Responseβ
{
"job_id": "974ceeda-84fe-4634-bdcd-adc895c6bc75",
"type": "spiderSite",
"status": "queued",
"created_at": "2025-10-27T14:30:00Z",
"from_cache": false,
"message": "SpiderSite job queued successfully. Estimated processing time: 15-30 seconds."
}
From Cache (Deduplication)β
If the same URL was crawled in the last 24 hours:
{
"job_id": "abc12345-6789-4def-ghij-klmnopqrstuv",
"type": "spiderSite",
"status": "completed",
"created_at": "2025-10-27T14:30:00Z",
"from_cache": true,
"message": "Job results retrieved from cache (original job: 974ceeda-84fe-4634-bdcd-adc895c6bc75)"
}
Deduplication: Identical URLs crawled within 24 hours return cached results instantly (Redis cache with 24hr TTL).
AI Token Costsβ
AI features are opt-in. By default, no AI tokens are used (0 cost). Enable only the features you need.
| Feature | AI Tokens | What You Get |
|---|---|---|
| Base crawl (no AI) | 0 tokens | Contact info + compendium |
extract_company_info | ~500 tokens | Company vitals (name, summary, industry, services, target audience) |
extract_team | ~500 tokens | Team members with names, titles, emails, LinkedIn |
extract_pain_points | ~500 tokens | Business challenges inferred from content |
| CHAMP scoring | +1,500 tokens | Full CHAMP analysis + ICP fit score + personalization hooks |
| Total (all features) | ~3,000 tokens | Complete lead profile |
Cost optimization: Start with basic crawl (0 tokens). Enable AI features only for high-value leads.
Processing Timeβ
| Scenario | Estimated Time |
|---|---|
| Simple site (5-10 pages) | 5-15 seconds |
| Medium site (10-20 pages) | 15-30 seconds |
| Large site (20-50 pages) | 30-60 seconds |
| SPA site (JavaScript-heavy) | +10-20 seconds |
| With AI extraction | +5-10 seconds |
| Full CHAMP analysis | 20-60 seconds total |
Best Practicesβ
When to use AI features
Use AI features when:
- Qualifying high-value leads
- Building targeted outreach campaigns
- Identifying decision makers
- Scoring leads by ICP fit
Skip AI features for:
- Bulk contact extraction
- Budget-sensitive scraping
- When you only need contact info
Choosing cleanup level
raw (100%): Academic research, legal compliance, full fidelity needed
fit (60%): General purpose, balances quality and size (default)
citations (70%): Academic papers, research documents with sources
minimal (30%): LLM consumption, token optimization, main content only
Optimizing crawl strategy
bestfirst: Best for most use cases - intelligent prioritization
Sitemap-first (auto): Used automatically when sitemap.xml discovered
bfs: When you need broad coverage across sections
dfs: When you need deep coverage of specific sections
SPA detection tips
Auto-detection works for:
- React, Vue, Angular apps
- Dynamically loaded content
- Infinite scroll sites
Increase spa_timeout if:
- Site loads slowly (>30s)
- Content loads after initial render
- You see incomplete data
Set enable_spa: false if:
- Site is static HTML (faster processing)
- You're getting timeout errors unnecessarily
Common Use Casesβ
1. Basic Lead Generation (0 AI Tokens)β
Extract contact info from company websites:
{
"payload": {
"url": "https://target-company.com",
"max_pages": 10
}
}
Returns: Emails, phones, addresses, social media, markdown compendium
2. Qualified Lead Scoring (CHAMP)β
Full analysis for high-value prospects:
{
"payload": {
"url": "https://qualified-lead.com",
"max_pages": 20,
"extract_company_info": true,
"extract_team": true,
"extract_pain_points": true,
"product_description": "Your product here...",
"icp_description": "Your ICP here..."
}
}
Returns: Full CHAMP analysis, ICP fit score, personalization hooks
3. Team Member Identificationβ
Find decision makers and contacts:
{
"payload": {
"url": "https://target-company.com",
"max_pages": 15,
"target_pages": ["team", "about", "leadership", "management"],
"extract_team": true
}
}
Returns: Team members with names, titles, emails, LinkedIn
4. Competitor Analysisβ
Understand company positioning and offerings:
{
"payload": {
"url": "https://competitor.com",
"max_pages": 25,
"extract_company_info": true,
"extract_pain_points": true,
"compendium": {
"cleanup_level": "citations",
"max_chars": 200000
}
}
}
Returns: Company summary, services, target audience, pain points, detailed content
5. Custom AI Analysis (v2.10.0)β
Extract industry-specific or custom data:
{
"payload": {
"url": "https://fintech-company.com",
"max_pages": 15,
"custom_ai_prompt": {
"enabled": true,
"system_prompt": "You are a fintech industry analyst.",
"user_prompt": "Extract: 1) Regulatory licenses held, 2) Banking partners mentioned, 3) Funding history, 4) Key product features",
"output_field_name": "fintech_analysis"
}
}
}
Returns: Custom structured data in data.fintech_analysis
Compendium Storage (SpiderMedia)β
v2.14.0: Compendiums are now stored in your client's SpiderMedia bucket with permanent public URLs. No more 24-hour expiration!
Compendiums are uploaded to your dedicated SpiderMedia storage (per-client SeaweedFS bucket). The download_url is a permanent public URL:
{
"data": {
"markdown_compendium": "# Company Name\n\nContent here...",
"compendium": {
"available": true,
"chars": 45000,
"cleanup_level": "fit",
"storage_location": "spidermedia",
"download_url": "https://media.spideriq.ai/client-xxx/compendiums/job-uuid.md",
"filename": "compendiums/job-uuid.md",
"size_bytes": 45000,
"content_hash": "abc123def456...",
"estimated_tokens": 11000
}
}
}
Storage Behaviorβ
| Scenario | Behavior |
|---|---|
| SpiderMedia configured | Uploaded to client bucket, permanent URL |
| SpiderMedia not configured | Inline in response (legacy fallback) |
| Upload fails | Inline in response with error logged |
Permanent URLs: Unlike the old R2 presigned URLs (24-hour expiry), SpiderMedia URLs never expire. You can store and reference them indefinitely.
Limitationsβ
Authentication: SpiderSite cannot scrape pages requiring login/authentication
CAPTCHAs: Sites with CAPTCHA protection cannot be scraped
Rate Limits: 100 requests per minute per API key
robots.txt: SpiderSite respects robots.txt directives