Firecrawl Integration
Overview
Firecrawl is a robust tool designed to transform websites into LLM-ready data by leveraging its Crawler and Scraper functionalities.
- Crawler: Automatically extracts data from websites by crawling web pages and following links. It recursively traverses sites (starting from a URL, using sitemaps when available), handles dynamic content rendered with JavaScript, and supports sync or async modes with webhook notifications.
- Scraper: Extracts targeted content from specific web pages using customizable rules. It supports single-URL or batch scraping, main-content-only extraction, tag inclusion/exclusion, TLS verification options, device emulation, and adjustable page-load timing.
Whether you need to map full site structures or extract specific data from pages, Firecrawl provides a seamless and customizable solution.
Firecrawl nodes can now be used directly inside sync or async nodes. You no longer need to create a separate flow for crawling or scraping.
Features
Key Functionalities (Crawler + Scraper)
Crawler
- Comprehensive Crawling: Recursively traverses websites, identifies and accesses subpages, uses sitemaps when available, and follows links for thorough data collection.
- Dynamic Content Handling: Manages JavaScript-rendered content for full extraction from accessible subpages.
- Sync & Async Modes: Run crawls synchronously or asynchronously with webhook callbacks for completion, page, started, and failed events.
Scraper
- Customizable URL Scraping: Specify exact URLs to scrape for targeted data extraction.
- Selective Content: Scrape only main content or include/exclude specific HTML tags.
- TLS Verification: Option to skip TLS verification for broader website compatibility.
- Device Emulation: Emulate mobile devices for mobile-optimized pages.
- Adjustable Load Timing: Set wait time (ms) for dynamic content to load.
Shared
- Agent Workflows: Async Agent, Sync Agent, and Check Agent for intelligent web data extraction.
- Webhooks: Real-time updates for crawl/scrape completion and events.
Benefits
- Reliability: Handles proxies, rate limits, and anti-scraping measures for consistent extraction (Crawler).
- Efficiency: Manages requests to minimize bandwidth and avoid detection (Crawler).
- Precision: Targeted scraping with tag and scope control (Scraper).
- Flexibility: Mobile emulation and TLS options for diverse sites (Scraper).
- Generate structured, LLM-compatible data; customize inclusion/exclusion; handle static and dynamic content.
Prerequisites
Before using Firecrawl, ensure the following:
- A valid Firecrawl API Key (opens in a new tab).
- Access to the Firecrawl service host URL.
- Properly configured credentials for Firecrawl.
- A webhook endpoint for receiving notifications (required for the crawler).
For Self Hosting,If the connection fails, whitelist the following IPs: https://www.cloudflare.com/ips/ (opens in a new tab)
Setup
Step 1: Obtain API Credentials
- Register on Firecrawl (opens in a new tab).
- Generate an API key from your account dashboard.
- Note the Host URL and Webhook Endpoint.
Step 2: Configure Firecrawl Credentials
Use the following format to set up your credentials:
| Key Name | Description | Example Value |
|---|---|---|
| Credential Name | Name to identify this set of credentials | my-firecrawl-creds |
| Firecrawl API Key | Authentication key for accessing Firecrawl services | fc_api_xxxxxxxxxxxxx |
| Host | Base URL where Firecrawl service is hosted | https://api.firecrawl.dev |
Configuration Reference
Sync Mode Output Format
Batched Mode
{
"success": true,
"status": "completed",
"completed": 48,
"total": 50,
"creditsUsed": 13,
"expiresAt": "2025-08-01T12:30:00.000Z",
"data": [
{
"url": "https://example.com/page-1",
"content": "Lorem ipsum dolor sit amet...",
"metadata": {
"title": "Page 1 Title",
"description": "This is a sample description.",
"language": "en"
}
},
{
"url": "https://example.com/page-2",
"content": "Second page scraped content...",
"metadata": {
"title": "Page 2 Title",
"description": "Another sample description.",
"language": "en"
}
}
// ... more pages
]
}Single Mode
{
"success": true,
"status": "completed",
"completed": 1,
"total": 2,
"creditsUsed": 1,
"expiresAt": "2025-08-02T12:30:00.000Z",
"data": [
{
"url": "https://example.com/page-1",
"content": "Lorem ipsum dolor sit amet...",
"metadata": {
"title": "Page 1 Title",
"description": "This is a sample description.",
"language": "en"
}
}
]
}Async Mode Output Format
{
"success": true,
"id": "8***************************7",
"url": "https://api.firecrawl.dev/v1/crawl/8***************************7"
}Step 3: Set Up the Mode (Crawler)
For the Crawler, choose sync or async mode. In async mode, configure a webhook endpoint to receive crawl results (create a Webhook flow/URL in Lamatic to receive crawl updates and results). Async options include:
| Parameter | Description | Example Value |
|---|---|---|
| Callback Webhook | URL to receive notifications about crawl completion | https://example.com/webhook |
| Webhook Headers | Headers sent to the webhook | {'Content-Type':application/json'} |
| Webhook Metadata | Metadata sent to the webhook | {'status':'{{codeNode_540.status}}'} |
| Webhook Events | Events to send: completed, failed, page, started | ["completed", "failed", "page", "started"] |
Crawler Configuration (Single)
| Parameter | Description | Example Value |
|---|---|---|
| Credential Name | Select previously saved credentials | my-firecrawl-creds |
| URL | Starting point URL for the crawler | https://example.com |
| Exclude Path | URL patterns to exclude from the crawl | "admin/*", "private/*" |
| Include Path | URL patterns to include in the crawl | "blog/*", "products/*" |
| Crawl Depth | Maximum depth to crawl relative to the entered URL | 3 |
| Crawl Limit | Maximum number of pages to crawl | 1000 |
| Crawl Sub Pages | Toggle to enable or disable crawling sub pages | true |
| Max Discovery Depth | Max depth for discovering new URLs during the crawl | 5 |
| Ignore Sitemap | Ignore the sitemap.xml file for crawling | false |
| Allow Backward Links | Allow crawling backward links (e.g., blog post → homepage) | true |
| Allow External Links | Allow crawling external links (e.g., links to other domains) | false |
| Ignore Query Parameters | Ignore specific query parameters in URLs | false |
| Delay | Delay between requests to avoid overloading server (in seconds) | 2 |
Batch Crawler Configuration (Async / Sync)
| Parameter | Description | Example Value |
|---|---|---|
| Credential Name | Select previously saved credentials | my-firecrawl-creds |
| URL List | List of starting URLs to crawl | [ "https://x.com", "https://y.com" ] |
| Include Path | Paths to include during crawl | "blog/*" |
| Exclude Path | Paths to exclude during crawl | "admin/*" |
| Crawl Depth | Depth to crawl for each URL | 3 |
| Crawl Limit | Max pages per domain | 500 |
| Max Discovery Depth | How far discovered links can go | 4 |
| Allow External Links | Whether to crawl external domains | false |
| Allow Backward Links | Whether to revisit previous pages | true |
| Crawl Sub Pages | Enable sub-page traversal | true |
| Ignore Sitemap | Skip sitemap.xml | false |
| Delay | Throttle request delay in seconds | 1 |
| Callback Webhook | URL to receive notifications about crawl completion | https://example.com/webhook |
| Webhook Headers | Headers to be sent to the webhook | {'Content-Type:application/json'} |
| Webhook Metadata | Metadata to be sent to the webhook | {'status':'{{codeNode_540.status}}'} |
| Webhook Events | A multiselect list of events to be sent to the webhook | ["completed", "failed", "page", "started"] |
Scraper Configuration (Single)
| Parameter | Description | Example Value |
|---|---|---|
| Credential Name | Select previously saved credentials | my-firecrawl-creds |
| URL | Target URL to scrape | https://example.com/page |
| Main Content | Extract only main content (exclude header/footer/nav) | true |
| Skip TLS Verification | Bypass SSL certificate validation | false |
| Include Tags | HTML tags to include in extraction | p, h1, h2, article |
| Exclude Tags | HTML tags to exclude from extraction | nav, footer, aside |
| Emulate Mobile Device | Simulate mobile browser access | true |
| Wait for Page Load | Time to wait for dynamic content (in ms) | 123 |
Batch Scraper Configuration (Async)
| Parameter | Description | Example Value |
|---|---|---|
| Credential Name | Select previously saved credentials | my-firecrawl-creds |
| URL List | List of URLs to scrape in batch | [ "https://a.com", "https://b.com" ] |
| Main Content | Extract only main content from each page | true |
| Skip TLS Verification | Ignore SSL certificate errors | false |
| Include Tags | HTML tags to extract | p, h1, h2 |
| Exclude Tags | HTML tags to exclude from extraction | aside, footer |
| Emulate Mobile Device | Use mobile browser viewport | true |
| Wait for Page Load | Delay for dynamic content to appear (in ms) | 200 |
| Callback Webhook | URL to receive notifications about crawl completion | https://example.com/webhook |
| Webhook Headers | Headers to be sent to the webhook | {'Content-Type:application/json'} |
| Webhook Metadata | Metadata to be sent to the webhook | {'status':'{{codeNode_540.status}}'} |
| Webhook Events | A multiselect list of events to be sent to the webhook | ["completed", "failed", "page", "started"] |
Low-Code Examples
Crawler Node
nodes:
- nodeId: crawlerNode_880
nodeType: crawlerNode
nodeName: Crawler
values:
credentials: my-firecrawl-creds
url: https://example.com
crawlMode: async
webhook: https://example.com/webhook
webhookEvents: [completed, failed, page, started]
crawlSubPages: true
crawlLimit: 10
crawlDepth: 1
excludePath: []
includePath: []
maxDiscoveryDepth: 1
ignoreSitemap: false
allowBackwardLinks: false
allowExternalLinks: false
delay: 0
needs:
- triggerNode_1Scraper Node
nodes:
- nodeId: scraperNode_680
nodeType: scraperNode
nodeName: Scraper
values:
credentials: my-firecrawl-creds
url: https://example.com/page
onlyMainContent: false
skipTLsVerification: false
mobile: false
waitFor: 123
includeTags: []
excludeTags: []
needs:
- triggerNode_1Crawler Event Output (Async Mode)
When using async crawler mode, the node can emit these events to your webhook:
- Started: When the crawl begins (job ID, URL).
- Page: For each crawled page (URL, extracted data).
- Completed: When the crawl finishes (summary, page count, errors).
- Failed: When the crawl fails (error message, job ID).
Example – Started:
{
"success": true,
"type": "crawl.started",
"id": "82***********************4",
"data": [],
"metadata": {}
}Scraper Output Schema
- markdown: Scraped content as Markdown.
- language: Detected language of the content.
- referrer: Referrer URL, if applicable.
- title: Title of the scraped page.
- scrapeId: Unique identifier for the scrape.
- sourceURL: URL from which content was scraped.
- url: Resolved URL of the resource.
- statusCode: HTTP status code of the scrape.
Example:
{
"markdown": "# Page content...",
"language": "en",
"referrer": "",
"title": "Example Page",
"scrapeId": "uuid-here",
"sourceURL": "https://example.com/page",
"url": "https://example.com/page",
"statusCode": 200
}Map URL Configuration
| Parameter | Description | Example Value |
|---|---|---|
| Credential Name | Select previously saved credentials | my-firecrawl-creds |
| URL | Starting URL to map the structure | https://example.com |
| Main Content | Extract only main content from each page | true |
| Skip TLS Verification | Ignore SSL certificate errors | false |
| Include Tags | HTML tags to extract | p, h1, h2 |
| Exclude Tags | HTML tags to exclude from extraction | aside, footer |
| Emulate Mobile Device | Use mobile browser viewport | true |
| Wait for Page Load | Delay for dynamic content to appear (in ms) | 200 |
Map URL Output Example
{
"success": true,
"links": [
"https://lamatic.ai/docs",
"https://lamatic.ai/docs/architecture",
"https://lamatic.ai/docs/career",
"https://lamatic.ai/docs/context",
"https://lamatic.ai/docs/context/vectordb",
"https://lamatic.ai/docs/context/vectordb/adding-data",
"https://lamatic.ai/docs/contributing",
"https://lamatic.ai/docs/demo",
"https://lamatic.ai/docs/deployments",
"https://lamatic.ai/docs/deployments/cache"
]
}Agent Workflow Configuration
Firecrawl now supports agent workflows with three execution modes for submitting and monitoring agent tasks.
Agent Modes
- Async Agent: Submit agent tasks asynchronously and receive a task ID for later status checking.
- Sync Agent: Execute agent tasks synchronously and receive results immediately upon completion.
- Check Agent: Check the status of a previously submitted async agent task.
Agent Configuration (Async / Sync)
| Parameter | Description | Example Value |
|---|---|---|
| Credential Name | Select previously saved credentials | my-firecrawl-creds |
| Prompt | The prompt or instruction for the agent to execute | "Extract all product prices from the page" |
| URLs | Optional list of URLs to limit the agent’s scope. You can provide this as an array of URLs or as a comma-separated list. | [ "https://example.com/products" ] |
| Schema | JSON schema defining the expected output structure | {"price": "string", "name": "string"} |
| Max Credit | Maximum credits to use for the agent task | 100 |
| Strict URL Constraints | Enforce strict URL matching rules | true |
Check Agent Configuration
| Parameter | Description | Example Value |
|---|---|---|
| Credential Name | Select previously saved credentials | my-firecrawl-creds |
| Agent Job ID | The task ID returned from an Async Agent submission | "8***************************7" |
Troubleshooting
Common Issues
| Problem | Solution |
|---|---|
| Invalid API Key | Ensure the API key is correct and has not expired. |
| Connection Issues | Verify that the host URL is correct and reachable. |
| Webhook Errors | Check if the webhook endpoint is active and correctly configured. |
| Crawling Errors | Review the inclusion/exclusion paths for accuracy. |
| Dynamic Content Not Loaded | Increase the Wait for Page Load time in the configuration. |
Debugging
- Check Firecrawl logs for detailed error information.
- Test the webhook endpoint to confirm it is receiving updates.
- If the connection fails, whitelist the following IPs: https://www.cloudflare.com/ips/ (opens in a new tab)