HTTP-based web page fetching and content extraction tool.

TypeScript 93.2%
JavaScript 6.8%

Find a file

James Peret 02541fa47e Rename tool to kebab-case convention Changed ToolDefinition name from webFetchHttp to web-fetch-http for naming consistency with other tools. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>		2026-04-06 18:31:14 -03:00
src	Rename tool to kebab-case convention	2026-04-06 18:31:14 -03:00
tests	Refactor HTML extraction to Cheerio-based plain text pipeline	2026-03-15 08:21:41 -03:00
.gitignore	Initial commit: HTTP-based web fetch tool with AI content extraction	2025-09-12 19:21:23 -03:00
package-lock.json	Refactor HTML extraction to Cheerio-based plain text pipeline	2026-03-15 08:21:41 -03:00
package.json	Refactor HTML extraction to Cheerio-based plain text pipeline	2026-03-15 08:21:41 -03:00
README.md	Add context window fix with HTML preprocessing and semantic chunking	2025-09-19 04:02:59 -03:00
tsconfig.json	Initial commit: HTTP-based web fetch tool with AI content extraction	2025-09-12 19:21:23 -03:00
vitest.config.ts	Add context window fix with HTML preprocessing and semantic chunking	2025-09-19 04:02:59 -03:00

README.md

Web Fetch HTTP Tool

HTTP-based web page fetching and content extraction tool for the Fractal Synapse agent system.

Overview

The Web Fetch HTTP Tool provides a lightweight alternative to browser-based web scraping by using HTTP requests and AI-powered content extraction. It fetches web pages directly via HTTP and uses OpenAI to intelligently extract and structure the content.

Features

HTTP-based fetching - No browser dependencies, faster and more lightweight
AI-powered extraction - Uses OpenAI GPT-4o-mini to intelligently parse and extract content
Context window handling - Smart preprocessing and chunking for large pages (e.g., Wikipedia)
Multiple extraction modes - Text extraction, CSS selector targeting, or structured data
Robust error handling - Comprehensive error reporting with structured error objects
Same interface as Stagehand - Drop-in replacement for browser-based web-fetch tools

Installation

npm install
npm run build

Usage

Parameters

url (required) - The URL to fetch content from
selector (optional) - CSS selector to extract specific content
extractText (optional, default: true) - Whether to extract text content

Extraction Modes

Text Mode (default, extractText: true)

await webFetchHttp('https://example.com')
// Returns: { title, content, summary, url, timestamp }

Selector Mode (when selector is provided)

await webFetchHttp('https://example.com', '.article-content')
// Returns: { title, content, url, timestamp }

Structured Mode (extractText: false)

await webFetchHttp('https://example.com', undefined, false)
// Returns: { title, content, links, images, url, timestamp }

Example Results

Text Mode:

{
  "url": "https://example.com",
  "timestamp": "2025-09-12T10:30:00.000Z",
  "title": "Example Page",
  "content": "Main text content of the page...",
  "summary": "Brief summary of the page content"
}

Structured Mode:

{
  "url": "https://example.com", 
  "timestamp": "2025-09-12T10:30:00.000Z",
  "title": "Example Page",
  "content": "Main text content...",
  "links": ["https://example.com/page1", "https://example.com/page2"],
  "images": ["https://example.com/image1.jpg", "https://example.com/image2.png"]
}

Error Handling

The tool returns structured error objects for all failure scenarios:

{
  "error": true,
  "message": "Human-readable error message",
  "details": "Technical details about the error",
  "timestamp": "2025-09-12T10:30:00.000Z",
  "toolName": "web-fetch-http",
  "url": "https://failed-url.com"
}

Common error scenarios:

Invalid URL format
Network connectivity issues
HTTP errors (404, 500, etc.)
AI content extraction failures
Malformed HTML content

Context Window Handling

The tool automatically handles large web pages that would exceed AI model context windows:

HTML Preprocessing - Removes unnecessary tags (scripts, styles, navigation, ads)
CSS Selector Early Application - Reduces content size before AI processing
Semantic Chunking - Splits large content at natural boundaries (sections, articles)
Token Estimation - Monitors content size and applies chunking when needed (>15,000 tokens)
Result Combination - Merges chunked results while preserving structure

Example with Wikipedia Moon page (>240,000 tokens):

// Automatically chunks and processes without context window errors
const result = await webFetchHttp('https://en.wikipedia.org/wiki/Moon')
// Returns combined content from all chunks

Testing

Unit Tests

npm run test:unit          # Run unit tests only
npm run test:run           # Run all tests including integration
npm run test:watch         # Watch mode for development

Integration Tests

npm run test:integration              # Run integration tests (skips without API key)
npm run test:integration:expensive    # Run expensive tests (requires API key)

The tests cover:

HTML preprocessing and chunking logic
Context window handling
CSS selector extraction
Error handling scenarios
Wikipedia Moon page integration test
Agent-core integration patterns

Requirements

Node.js environment
OPENAI_API_KEY environment variable for AI content extraction
Internet connectivity for fetching web pages

Integration

To integrate with Fractal Synapse agent applications:

Add to package.json dependencies:

{
  "dependencies": {
    "web-fetch-http-tool": "file:../../packages/tools/web-fetch-http-tool"
  }
}

Import and register:

import { webFetchHttpToolDefinition } from 'web-fetch-http-tool';

toolRegistry.registerTool('webFetchHttp', webFetchHttpToolDefinition);

Add to AgentDefinition:

const agentDefinition = new AgentDefinition(
  'My Agent',
  'Description', 
  'System prompt',
  ['webFetchHttp'], // Include tool name
  'openai-gpt-4o'
);

Comparison with Stagehand Tools

Feature	Web Fetch HTTP	Stagehand Web Fetch
Speed	⚡ Fast	🐌 Slower
Resources	💡 Lightweight	🔋 Heavy (browser)
JavaScript Support	❌ No	✅ Yes
Complex Interactions	❌ No	✅ Yes
Setup Complexity	✅ Simple	❌ Complex
Reliability	✅ High	⚠️ Browser dependencies

License

ISC

Author

James Peret