HTTP-based web page fetching and content extraction tool.
  • TypeScript 93.2%
  • JavaScript 6.8%
Find a file
James Peret 02541fa47e
Rename tool to kebab-case convention
Changed ToolDefinition name from webFetchHttp to web-fetch-http for
naming consistency with other tools.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-06 18:31:14 -03:00
src Rename tool to kebab-case convention 2026-04-06 18:31:14 -03:00
tests Refactor HTML extraction to Cheerio-based plain text pipeline 2026-03-15 08:21:41 -03:00
.gitignore Initial commit: HTTP-based web fetch tool with AI content extraction 2025-09-12 19:21:23 -03:00
package-lock.json Refactor HTML extraction to Cheerio-based plain text pipeline 2026-03-15 08:21:41 -03:00
package.json Refactor HTML extraction to Cheerio-based plain text pipeline 2026-03-15 08:21:41 -03:00
README.md Add context window fix with HTML preprocessing and semantic chunking 2025-09-19 04:02:59 -03:00
tsconfig.json Initial commit: HTTP-based web fetch tool with AI content extraction 2025-09-12 19:21:23 -03:00
vitest.config.ts Add context window fix with HTML preprocessing and semantic chunking 2025-09-19 04:02:59 -03:00

Web Fetch HTTP Tool

HTTP-based web page fetching and content extraction tool for the Fractal Synapse agent system.

Overview

The Web Fetch HTTP Tool provides a lightweight alternative to browser-based web scraping by using HTTP requests and AI-powered content extraction. It fetches web pages directly via HTTP and uses OpenAI to intelligently extract and structure the content.

Features

  • HTTP-based fetching - No browser dependencies, faster and more lightweight
  • AI-powered extraction - Uses OpenAI GPT-4o-mini to intelligently parse and extract content
  • Context window handling - Smart preprocessing and chunking for large pages (e.g., Wikipedia)
  • Multiple extraction modes - Text extraction, CSS selector targeting, or structured data
  • Robust error handling - Comprehensive error reporting with structured error objects
  • Same interface as Stagehand - Drop-in replacement for browser-based web-fetch tools

Installation

npm install
npm run build

Usage

Parameters

  • url (required) - The URL to fetch content from
  • selector (optional) - CSS selector to extract specific content
  • extractText (optional, default: true) - Whether to extract text content

Extraction Modes

  1. Text Mode (default, extractText: true)

    await webFetchHttp('https://example.com')
    // Returns: { title, content, summary, url, timestamp }
    
  2. Selector Mode (when selector is provided)

    await webFetchHttp('https://example.com', '.article-content')
    // Returns: { title, content, url, timestamp }
    
  3. Structured Mode (extractText: false)

    await webFetchHttp('https://example.com', undefined, false)
    // Returns: { title, content, links, images, url, timestamp }
    

Example Results

Text Mode:

{
  "url": "https://example.com",
  "timestamp": "2025-09-12T10:30:00.000Z",
  "title": "Example Page",
  "content": "Main text content of the page...",
  "summary": "Brief summary of the page content"
}

Structured Mode:

{
  "url": "https://example.com", 
  "timestamp": "2025-09-12T10:30:00.000Z",
  "title": "Example Page",
  "content": "Main text content...",
  "links": ["https://example.com/page1", "https://example.com/page2"],
  "images": ["https://example.com/image1.jpg", "https://example.com/image2.png"]
}

Error Handling

The tool returns structured error objects for all failure scenarios:

{
  "error": true,
  "message": "Human-readable error message",
  "details": "Technical details about the error",
  "timestamp": "2025-09-12T10:30:00.000Z",
  "toolName": "web-fetch-http",
  "url": "https://failed-url.com"
}

Common error scenarios:

  • Invalid URL format
  • Network connectivity issues
  • HTTP errors (404, 500, etc.)
  • AI content extraction failures
  • Malformed HTML content

Context Window Handling

The tool automatically handles large web pages that would exceed AI model context windows:

  • HTML Preprocessing - Removes unnecessary tags (scripts, styles, navigation, ads)
  • CSS Selector Early Application - Reduces content size before AI processing
  • Semantic Chunking - Splits large content at natural boundaries (sections, articles)
  • Token Estimation - Monitors content size and applies chunking when needed (>15,000 tokens)
  • Result Combination - Merges chunked results while preserving structure

Example with Wikipedia Moon page (>240,000 tokens):

// Automatically chunks and processes without context window errors
const result = await webFetchHttp('https://en.wikipedia.org/wiki/Moon')
// Returns combined content from all chunks

Testing

Unit Tests

npm run test:unit          # Run unit tests only
npm run test:run           # Run all tests including integration
npm run test:watch         # Watch mode for development

Integration Tests

npm run test:integration              # Run integration tests (skips without API key)
npm run test:integration:expensive    # Run expensive tests (requires API key)

The tests cover:

  • HTML preprocessing and chunking logic
  • Context window handling
  • CSS selector extraction
  • Error handling scenarios
  • Wikipedia Moon page integration test
  • Agent-core integration patterns

Requirements

  • Node.js environment
  • OPENAI_API_KEY environment variable for AI content extraction
  • Internet connectivity for fetching web pages

Integration

To integrate with Fractal Synapse agent applications:

  1. Add to package.json dependencies:

    {
      "dependencies": {
        "web-fetch-http-tool": "file:../../packages/tools/web-fetch-http-tool"
      }
    }
    
  2. Import and register:

    import { webFetchHttpToolDefinition } from 'web-fetch-http-tool';
    
    toolRegistry.registerTool('webFetchHttp', webFetchHttpToolDefinition);
    
  3. Add to AgentDefinition:

    const agentDefinition = new AgentDefinition(
      'My Agent',
      'Description', 
      'System prompt',
      ['webFetchHttp'], // Include tool name
      'openai-gpt-4o'
    );
    

Comparison with Stagehand Tools

Feature Web Fetch HTTP Stagehand Web Fetch
Speed Fast 🐌 Slower
Resources 💡 Lightweight 🔋 Heavy (browser)
JavaScript Support No Yes
Complex Interactions No Yes
Setup Complexity Simple Complex
Reliability High ⚠️ Browser dependencies

License

ISC

Author

James Peret