Skip to main content

Submit SpiderSite Job

POST/api/v1/jobs/spiderSite/submit

Overview​

Submit a SpiderSite job to crawl websites with contact extraction, AI-powered company analysis, team member identification, and CHAMP lead scoring.

info

Version 2.14.0: Compendiums now stored in SpiderMedia (per-client storage) with permanent public URLs. No more 24-hour expiration!

info

Version 2.10.0: Custom AI Prompts for tailored analysis, AI Context Engine with markdown compendiums, SPA auto-detection, sitemap-first crawling, and multilingual support (36+ languages).

Key Features​

πŸ•ΈοΈ

Smart Crawling

Sitemap-first with intelligent page prioritization

πŸ“‡

Contact Extraction

Emails, phones, addresses, 14 social platforms

🧠

AI Analysis

Company vitals, team members, pain points

⭐

Lead Scoring

CHAMP framework with ICP fit scoring

✨

Custom AI Prompts

Your own prompts for tailored analysis (v2.10.0)

☁️

SpiderMedia Storage

Compendiums stored with permanent public URLs (v2.14.0)


Request Body​

payloadobjectrequired

Job configuration payload

priorityintegerdefault: 0

Job priority (0-10, higher = processed first)


Response​

job_idstring

Unique job identifier (UUID format)

typestring

Always spiderSite for this endpoint

statusstring

Initial job status (always queued)

created_atstring

Job creation timestamp (ISO 8601)

from_cacheboolean

Whether this job was deduplicated from cache (24-hour TTL)

messagestring

Confirmation message


Request Examples​

Most basic request - only URL (contact extraction only, no AI):

    curl -X POST https://spideriq.ai/api/v1/jobs/spiderSite/submit \
-H "Authorization: Bearer <your_token>" \
-H "Content-Type: application/json" \
-d '{
"payload": {
"url": "https://example.com"
}
}'

What you get:

  • Contact info (emails, phones, addresses)
  • 14 social media platforms
  • Markdown compendium (fit level)
  • No AI tokens used (0 cost)

Example Response​

{
"job_id": "974ceeda-84fe-4634-bdcd-adc895c6bc75",
"type": "spiderSite",
"status": "queued",
"created_at": "2025-10-27T14:30:00Z",
"from_cache": false,
"message": "SpiderSite job queued successfully. Estimated processing time: 15-30 seconds."
}

From Cache (Deduplication)​

If the same URL was crawled in the last 24 hours:

{
"job_id": "abc12345-6789-4def-ghij-klmnopqrstuv",
"type": "spiderSite",
"status": "completed",
"created_at": "2025-10-27T14:30:00Z",
"from_cache": true,
"message": "Job results retrieved from cache (original job: 974ceeda-84fe-4634-bdcd-adc895c6bc75)"
}
info

Deduplication: Identical URLs crawled within 24 hours return cached results instantly (Redis cache with 24hr TTL).


AI Token Costs​

warning

AI features are opt-in. By default, no AI tokens are used (0 cost). Enable only the features you need.

FeatureAI TokensWhat You Get
Base crawl (no AI)0 tokensContact info + compendium
extract_company_info~500 tokensCompany vitals (name, summary, industry, services, target audience)
extract_team~500 tokensTeam members with names, titles, emails, LinkedIn
extract_pain_points~500 tokensBusiness challenges inferred from content
CHAMP scoring+1,500 tokensFull CHAMP analysis + ICP fit score + personalization hooks
Total (all features)~3,000 tokensComplete lead profile
tip

Cost optimization: Start with basic crawl (0 tokens). Enable AI features only for high-value leads.


Processing Time​

ScenarioEstimated Time
Simple site (5-10 pages)5-15 seconds
Medium site (10-20 pages)15-30 seconds
Large site (20-50 pages)30-60 seconds
SPA site (JavaScript-heavy)+10-20 seconds
With AI extraction+5-10 seconds
Full CHAMP analysis20-60 seconds total

Best Practices​

When to use AI features

Use AI features when:

  • Qualifying high-value leads
  • Building targeted outreach campaigns
  • Identifying decision makers
  • Scoring leads by ICP fit

Skip AI features for:

  • Bulk contact extraction
  • Budget-sensitive scraping
  • When you only need contact info
Choosing cleanup level

raw (100%): Academic research, legal compliance, full fidelity needed

fit (60%): General purpose, balances quality and size (default)

citations (70%): Academic papers, research documents with sources

minimal (30%): LLM consumption, token optimization, main content only

Optimizing crawl strategy

bestfirst: Best for most use cases - intelligent prioritization

Sitemap-first (auto): Used automatically when sitemap.xml discovered

bfs: When you need broad coverage across sections

dfs: When you need deep coverage of specific sections

SPA detection tips

Auto-detection works for:

  • React, Vue, Angular apps
  • Dynamically loaded content
  • Infinite scroll sites

Increase spa_timeout if:

  • Site loads slowly (>30s)
  • Content loads after initial render
  • You see incomplete data

Set enable_spa: false if:

  • Site is static HTML (faster processing)
  • You're getting timeout errors unnecessarily

Common Use Cases​

1. Basic Lead Generation (0 AI Tokens)​

Extract contact info from company websites:

{
"payload": {
"url": "https://target-company.com",
"max_pages": 10
}
}

Returns: Emails, phones, addresses, social media, markdown compendium


2. Qualified Lead Scoring (CHAMP)​

Full analysis for high-value prospects:

{
"payload": {
"url": "https://qualified-lead.com",
"max_pages": 20,
"extract_company_info": true,
"extract_team": true,
"extract_pain_points": true,
"product_description": "Your product here...",
"icp_description": "Your ICP here..."
}
}

Returns: Full CHAMP analysis, ICP fit score, personalization hooks


3. Team Member Identification​

Find decision makers and contacts:

{
"payload": {
"url": "https://target-company.com",
"max_pages": 15,
"target_pages": ["team", "about", "leadership", "management"],
"extract_team": true
}
}

Returns: Team members with names, titles, emails, LinkedIn


4. Competitor Analysis​

Understand company positioning and offerings:

{
"payload": {
"url": "https://competitor.com",
"max_pages": 25,
"extract_company_info": true,
"extract_pain_points": true,
"compendium": {
"cleanup_level": "citations",
"max_chars": 200000
}
}
}

Returns: Company summary, services, target audience, pain points, detailed content


5. Custom AI Analysis (v2.10.0)​

Extract industry-specific or custom data:

{
"payload": {
"url": "https://fintech-company.com",
"max_pages": 15,
"custom_ai_prompt": {
"enabled": true,
"system_prompt": "You are a fintech industry analyst.",
"user_prompt": "Extract: 1) Regulatory licenses held, 2) Banking partners mentioned, 3) Funding history, 4) Key product features",
"output_field_name": "fintech_analysis"
}
}
}

Returns: Custom structured data in data.fintech_analysis


Compendium Storage (SpiderMedia)​

info

v2.14.0: Compendiums are now stored in your client's SpiderMedia bucket with permanent public URLs. No more 24-hour expiration!

Compendiums are uploaded to your dedicated SpiderMedia storage (per-client SeaweedFS bucket). The download_url is a permanent public URL:

{
"data": {
"markdown_compendium": "# Company Name\n\nContent here...",
"compendium": {
"available": true,
"chars": 45000,
"cleanup_level": "fit",
"storage_location": "spidermedia",
"download_url": "https://media.spideriq.ai/client-xxx/compendiums/job-uuid.md",
"filename": "compendiums/job-uuid.md",
"size_bytes": 45000,
"content_hash": "abc123def456...",
"estimated_tokens": 11000
}
}
}

Storage Behavior​

ScenarioBehavior
SpiderMedia configuredUploaded to client bucket, permanent URL
SpiderMedia not configuredInline in response (legacy fallback)
Upload failsInline in response with error logged
tip

Permanent URLs: Unlike the old R2 presigned URLs (24-hour expiry), SpiderMedia URLs never expire. You can store and reference them indefinitely.


Limitations​

warning

Authentication: SpiderSite cannot scrape pages requiring login/authentication

warning

CAPTCHAs: Sites with CAPTCHA protection cannot be scraped

warning

Rate Limits: 100 requests per minute per API key

info

robots.txt: SpiderSite respects robots.txt directives


Next Steps​