Understanding the Crawling Process

Crawling Process Overview

When you add a website, our smart crawler running on Railway.com infrastructure automatically begins analyzing your site using Playwright browser automation to extract and process content.

Smart Crawler Service

FastAPI service running on Railway.com with Playwright for JavaScript-heavy sites

Content Processing

Content chunked and stored in PostgreSQL with pgvector embeddings for AI responses

Website Status Flow

Initial Status: Ready

Website created and validated, ready for crawling to begin

Status: Crawling

Smart crawler analyzes site structure, extracts content, clears old data, and generates text chunks

Technical: website_content and pending_chunks tables cleared, new content extracted via Playwright

Status: Embedding

Text chunks processed into vector embeddings using Fireworks.ai nomic-embed-text-v1.5 model

Technical: Embedding worker processes pending_chunks into vector representations for AI search

Status: Ready

Chatbot is trained and ready to answer questions about your website content

Result: All chunks embedded, chatbot can provide intelligent responses based on your content

Timing Expectations

Small Websites

2-5 min

Under 50 pages, simple structure, mostly static content

Medium Websites

5-15 min

50-500 pages, some JavaScript, moderate complexity

Large Websites

15+ min

500+ pages, complex JavaScript, e-commerce sites

Real-time Progress Monitoring

Dashboard Tracking

• Website Status: Live updates via Supabase realtime
• Chunk Progress: total_chunks_generated counter
• Embedding Progress: chunks_embedded counter
• Crawl Logs: Detailed activity logs in crawl_logs table

Background Services

• Smart Crawler: https://smart-crawler-production.up.railway.app
• Embedding Worker: Processes chunks in background
• Janitor Service: Monitors stalled jobs every hour
• Authentication: SERVICE_AUTH_TOKEN for security

When Things Go Wrong

Status: Failed

Common causes include:

• Website not publicly accessible (returns non-200 HTTP status)
• URL validation failed during validateUrl() check
• Smart crawler service unavailable or timed out
• Duplicate crawl job prevented (website already processing)

Recovery Options

• Reprocess Button: Triggers new crawl via /api/crawl/secure
• Status Reset: Clears existing content and starts fresh
• Conflict Protection: Prevents multiple simultaneous crawls
• Janitor Service: Automatically cleans up stalled jobs

Technical Architecture

Crawler Infrastructure

Service: FastAPI on Railway.com

Browser Engine: Playwright for JavaScript rendering

Authentication: Bearer token validation

Request URL: POST /crawl with website_id

Data Processing

Database: PostgreSQL with pgvector extension

Embeddings: Fireworks.ai nomic-embed-text-v1.5

Storage: website_content table with vector search

Real-time: Supabase channels for status updates

Next Steps

Once crawling is complete, your chatbot will be ready to deploy on your website.

Deploy Widget

Add the chatbot script to your website

Customize Design

Match your brand colors and styling

Troubleshooting

Resolve common crawling issues