Web crawling primitives for agents and developers

Crawl a million pages for $5. Scale forever.

Crawlspace is a centralized platform for developers to build and deploy web crawlers. Gather fresh data for your apps and agents while contributing to a platform-wide cache for crawler traffic.

Crawl at scale

Affordably crawl tens of millions of pages per month on horizontally-scaling architecture.

Scrape with confidence

Use LLMs or query selectors to extract JSON conforming to your custom schema.

Respect site owners

Follow robots.txt and rate-limit responses by default. Pull content from a platform-wide TTL cache.

Storage included

Put structured data in SQLite, unstructured data in a bucket, and semantic data in a vector db.

Simple to start, simple to scale

Support the foundation of your next groundbreaking idea

Find prospective customers

Scrape websites to find companies who use targeted technologies

import * as setup from './setup';

export default {
  ...setup,
  // Find companies on Vercel who use Google Tag Manager
  onResponse({ $, $$, enqueue, response }) {
    const isVercel = response.headers.get('server') === 'Vercel';
    const domain = "googletagmanager.com";
    const usesGTM = !!$(`link[href*="${domain}"],iframe[src*="${domain}"]`);
    // get the URL of every link on the page to continue traversal
    enqueue($$("a[href]"));
    // upsert into sqlite db if both conditions are true
    if (isVercel && usesGTM) {
      const title = $("head > title")?.innerText;
      const description = $("meta[name='description']")?.getAttribute("content");
      const row = { title, description, is_vercel: isVercel, has_gtm: usesGTM };
      return {
        upsert: { row, onConflict: "url" }
      };
    }
  }
} as Crawler;

RAG your docs

Schedule an hourly crawl to chat with your own documentation

import * as setup from './setup';

export default {
  ...setup,
  async onResponse({ $, $$, enqueue, ai, getMarkdown }) {
    // scrape page details with query selectors
    const row = {
      title: $("head > title")?.innerText,
      description: $("meta[name='description']")?.getAttribute("content"),
    };
    // convenience methods to generate, chunk, and embed markdown
    const markdown = getMarkdown($("body"));
    const { embeddings } = await ai.embed(markdown, { dimensions: 768 });
    // get the URL of every link on the page to continue traversal
    enqueue($$("a[href]"));
    return {
      upsert: { row, onConflict: "url", embeddings },
      attach: { content: markdown },
    };
  }
} as Crawler;

Conduct market research

Pull live data by crawling the open web

import * as setup from './setup';

export default {
  ...setup,
  // In this example, we extract pricing data from SaaS marketing sites
  async onResponse({ $, $$, ai, enqueue, response, z }) {
    // enqueue the homepage of each link on the current page
    const rootPages = $$("a[href]").map((a) => new URL(a.href).origin);
    enqueue(rootPages);
    // conditionally run inference
    const isPricingPage = response.url.endsWith("/pricing");
    if (isPricingPage) {
      const row = await ai.extract($("body"), {
        model: "meta/llama-3.3-70b",
        instruction: "Find the pricing plan of each tier",
        schema: setup.schema({ x }),
      });
      return {
        upsert: { row, onConflict: "url" },
      };
    }
  }
} as Crawler;

Monitor the growth of communities

Track Discord server popularity as a leading indicator of virality

import * as setup from './setup';

export default {
  ...setup,
  async onResponse({ $, $$, enqueue, ai, response, z }) {
    // enqueue the home page of each link, and any discord invite links
    const allUrls = $$("a[href]").map((a) => new URL(a.href).origin);
    const discordUrls = $$("a[href*='discord.com/invite/']");
    enqueue([...discordUrls, ...allUrls]);
    // if the current page is the "join discord community" page
    if (response.url.startsWith('https://discord.com/invite/')) {
      // extract data that conforms to your schema (or any Zod object)
      const schema = setup.schema({ z });
      const row = await ai.extract<z.infer<typeof schema>>($("html"), {
        model: "meta/llama-3.1-8b",
        instruction: "Find the server name and number of members",
        schema,
      });
      return {
        insert: { row },
      };
    }
  }
} as Crawler;

Grow a database of images

Save image data and alt text into a bucket

import * as setup from './setup';

export default {
  ...setup,
  async onResponse({ $, $$, enqueue, metadata, response }) {
    const images = $$('img[alt][src]').map(image => ({
      url: image.src,
      // include arbitrary metadata alongside the enqueued request...
      metadata: { alt_text: image.getAttribute('alt') },
    }));
    const links = $$('a');
    enqueue([...images, ...links]);
    const contentType = response.headers.get('content-type');
    if (contentType?.startsWith('image/')) {
      // ...to access later in the response
      const row = { content_type: contentType, ...metadata };
      return {
        upsert: { row, onConflict: "url" },
        attach: { content: (await response.blob()), metadata },
      };
    }
  }
} as Crawler;

Replicate an API provider's database

Mirror third-party data to circumvent ratelimits and track changes over time

import * as setup from './setup';

export default {
  ...setup,
  // Recursively find and store GitHub repo data
  onResponse({ json, enqueue, env, response }) {
    const BASE_URL = "https://api.github.com";
    const headers = { Authorization: `Bearer ${env.SECRET_GITHUB_TOKEN}` };
    // If the current response is a list of repos, enqueue their contributors
    // Alternatively, if the response is a list of users, enqueue their repos
    const routes = response.url.endsWith('/contributors')
      ? json.map((user) => `/users/${user.login}/repos`);
      : json.map((repo) => `/repos/${repo.full_name}/contributors`);
    const requests = routes.map((route) => ({ url: BASE_URL + route, headers }));
    enqueue(requests);
    if (response.url.endsWith('/repos')) {
      // return an array to perform a bulk insert
      return json.map((repo) => {
        const row = { repo_name: repo.full_name, num_stars: repo.stargazers_count };
        return { insert: { row } };
      });
    }
  }
} as Crawler;

By developers, for developers

Deploy web crawlers as easily as you deploy websites

Resilient

Use AI and LLMs to make your crawlers resilient to website changes

Performant

Let the platform handle scaling and concurrency for you

Compliant

Respect robots.txt and rate-limiting responses out of the box

Stateful

Every crawler gets its own queue, SQLite db, vector db, and S3-compatible bucket

Serverless

Deploy web crawlers without maintaining your own infra

TypeScript-first

Write type-safe code and import packages from the npm ecosystem

JavaScript capable

Render single-page applications that require JavaScript to run

Observable

Observe traffic logs using an OpenTelemetry provider of your choice

Edge cache

Reduce global traffic by pulling from a platform-wide TTL cache

Scheduling

Set your crawlers to run on a consistent schedule

Secrets

Crawl pages behind auth using your encrypted credentials

Always-free egress

Accumulate massive datasets. Download at zero cost.

Platform FAQ

How does Crawlspace shield websites from redundant bot traffic?

Should I use CommonCrawl instead?

Can I bypass pages that use CAPTCHA or other anti-bot challenges?

What are the storage limits?

Are HTTP proxies supported?

Can I set the user agent of outbound requests?

Can I use AI models from providers like OpenAI and Anthropic?

Can I use Puppeteer or Playwright APIs to control page interaction?

What will your crawlers find?

Get started for free