Instagram Profile Scraping
Overview
SpiderPublicInstagram allows you to extract public profile data from Instagram without requiring login credentials. This is useful for:
- Lead Enrichment: Add Instagram presence and contact info to existing leads
- Influencer Research: Build databases with verified follower counts and engagement metrics
- Contact Discovery: Extract business emails and phone numbers from profiles
- Brand Monitoring: Track competitor Instagram presence
No Login Required
SpiderPublicInstagram uses Instagram's public web API endpoint. It does not require Instagram login credentials, making it safe and compliant for public data extraction.
Quick Start
1. Submit a Profile Scraping Job
curl -X POST "https://spideriq.ai/api/v1/jobs/spiderPublicInstagram/submit" \
-H "Authorization: Bearer $CLIENT_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"payload": {
"username": "natgeo"
}
}'
2. Check Job Status
curl "https://spideriq.ai/api/v1/jobs/{job_id}/status" \
-H "Authorization: Bearer $CLIENT_TOKEN"
3. Get Results
curl "https://spideriq.ai/api/v1/jobs/{job_id}/results" \
-H "Authorization: Bearer $CLIENT_TOKEN"
Input Formats
SpiderPublicInstagram accepts various input formats:
- Username Only
- Full URL
- URL with Parameters
- With @ Symbol
{
"payload": {
"username": "natgeo"
}
}
{
"payload": {
"instagram_url": "https://instagram.com/natgeo"
}
}
{
"payload": {
"instagram_url": "https://www.instagram.com/natgeo/?igsh=abc123"
}
}
{
"payload": {
"username": "@natgeo"
}
}
What Data Can You Extract?
Profile Information
| Field | Description | Always Available |
|---|---|---|
username | Instagram handle | Yes |
full_name | Display name | Yes |
bio | Profile biography | Public profiles only |
external_url | Website link | If configured |
profile_pic_url | Profile image URL | Yes |
Engagement Metrics
| Field | Description |
|---|---|
follower_count | Number of followers |
following_count | Number following |
post_count | Total posts |
Account Type Flags
| Field | Description |
|---|---|
is_business_account | Business account |
is_professional_account | Creator/professional |
is_verified | Blue checkmark |
is_private | Private profile |
Business Information (Business Accounts Only)
| Field | Description |
|---|---|
business_category | Category (e.g., "Restaurant") |
business_email | Contact email |
business_phone | Contact phone |
Extracted Contacts
| Field | Description |
|---|---|
bio_emails | Emails found in bio text |
bio_phones | Phone numbers found in bio text |
Contact Extraction
SpiderPublicInstagram extracts contact information from two sources:
1. Business Profile Settings
Business accounts can configure contact information in their profile settings. This appears as:
business_email: Official contact emailbusiness_phone: Official contact phone
2. Bio Text Parsing
Many users include contact information directly in their bio text. SpiderPublicInstagram uses regex patterns to extract:
- Email addresses: Standard email format detection
- Phone numbers: US format, international format, and raw digits
Example bio:
Contact us: hello@company.com | +1 (555) 123-4567
Extracted:
{
"bio_emails": ["hello@company.com"],
"bio_phones": ["+1 (555) 123-4567"]
}
Enable Contact Extraction
Contact extraction is enabled by default. To disable it, set extract_contact_from_bio: false in your payload.
Profile Image Hosting
Instagram CDN URLs can expire. SpiderPublicInstagram can upload profile images to SpiderMedia for permanent hosting:
{
"payload": {
"username": "natgeo",
"store_profile_image": true
}
}
Response includes both URLs:
{
"profile_pic_url": "https://scontent-xxx.cdninstagram.com/...",
"profile_pic_url_hosted": "https://media.spideriq.ai/client-xxx/instagram_profile_natgeo.jpg"
}
| URL Type | Pros | Cons |
|---|---|---|
profile_pic_url | Original quality | May expire |
profile_pic_url_hosted | Permanent, fast CDN | Stored in your quota |
Batch Processing
For processing multiple profiles, submit jobs in a loop:
import requests
import time
profiles = ["natgeo", "nasa", "google", "microsoft"]
job_ids = []
for username in profiles:
response = requests.post(
"https://spideriq.ai/api/v1/jobs/spiderPublicInstagram/submit",
headers={
"Authorization": f"Bearer {token}",
"Content-Type": "application/json"
},
json={"payload": {"username": username}}
)
job_ids.append(response.json()["job_id"])
time.sleep(1) # Small delay between submissions
print(f"Submitted {len(job_ids)} jobs")
Combining with Other Workers
Instagram → SpiderSite Pipeline
Extract Instagram data, then scrape the linked website:
# Step 1: Get Instagram profile
instagram_job = submit_instagram_job("company_handle")
instagram_data = wait_for_results(instagram_job["job_id"])
# Step 2: Scrape the linked website
if instagram_data.get("external_url"):
website_job = submit_spidersite_job(instagram_data["external_url"])
website_data = wait_for_results(website_job["job_id"])
Campaign Workflow Integration
SpiderPublicInstagram results can be enriched alongside SpiderMaps campaigns:
- Run SpiderMaps campaign to discover businesses
- Extract Instagram URLs from business data
- Submit SpiderPublicInstagram jobs for each Instagram profile
- Merge results for comprehensive lead data
Rate Limits and Best Practices
Instagram Rate Limits
| Limit | Value |
|---|---|
| Requests per hour per IP | ~200 |
| Built-in delay | 3-10 seconds |
Best Practices
Use Mobile Proxies
Instagram blocks datacenter IPs quickly. SpiderProxy mobile proxies are automatically assigned for production jobs, providing carrier-grade IP addresses.
Respect Rate Limits
Don't submit more than 100-200 jobs per hour. The worker includes built-in delays, but submitting too many jobs can still trigger blocks.
Handle Private Profiles
Private profiles return limited data. Check is_private: true in results and handle accordingly in your application.
Use Hosted Images
Always use profile_pic_url_hosted for display in your application. Instagram CDN URLs can expire or be blocked.
Error Handling
Common Errors
| Error | Cause | Solution |
|---|---|---|
| Profile not found | Username doesn't exist | Verify username is correct |
| Rate limited | Too many requests | Wait and retry later |
| IP blocked | Datacenter IP detected | Use mobile proxy (automatic in production) |
Retry Strategy
For rate limit errors, implement exponential backoff:
import time
def get_instagram_profile(username, max_retries=3):
for attempt in range(max_retries):
job = submit_job(username)
result = wait_for_results(job["job_id"])
if result.get("success"):
return result["data"]
if "rate limit" in result.get("error", "").lower():
wait_time = (2 ** attempt) * 60 # 1, 2, 4 minutes
time.sleep(wait_time)
else:
raise Exception(result.get("error"))
raise Exception("Max retries exceeded")
Example: Lead Enrichment
Complete example enriching leads with Instagram data:
import requests
API_BASE = "https://spideriq.ai/api/v1"
TOKEN = "your_token"
def enrich_lead_with_instagram(lead):
"""Add Instagram data to a lead record."""
instagram_handle = lead.get("instagram")
if not instagram_handle:
return lead
# Submit job
response = requests.post(
f"{API_BASE}/jobs/spiderPublicInstagram/submit",
headers={"Authorization": f"Bearer {TOKEN}"},
json={"payload": {"username": instagram_handle}}
)
job_id = response.json()["job_id"]
# Wait for results (with polling)
for _ in range(30): # Max 30 attempts
status = requests.get(
f"{API_BASE}/jobs/{job_id}/status",
headers={"Authorization": f"Bearer {TOKEN}"}
).json()
if status["status"] == "completed":
results = requests.get(
f"{API_BASE}/jobs/{job_id}/results",
headers={"Authorization": f"Bearer {TOKEN}"}
).json()
# Enrich lead with Instagram data
lead["instagram_followers"] = results["data"]["follower_count"]
lead["instagram_verified"] = results["data"]["is_verified"]
lead["instagram_bio"] = results["data"]["bio"]
# Add any discovered contacts
if results["data"].get("business_email"):
lead.setdefault("emails", []).append(results["data"]["business_email"])
if results["data"].get("bio_emails"):
lead.setdefault("emails", []).extend(results["data"]["bio_emails"])
break
elif status["status"] == "failed":
lead["instagram_error"] = status.get("error")
break
time.sleep(2)
return lead
# Usage
lead = {
"company": "National Geographic",
"instagram": "natgeo"
}
enriched_lead = enrich_lead_with_instagram(lead)
print(enriched_lead)