NotexAI
  • Quickstart
  • Python SDK
  • Authentication
  • Examples
    • Navigation
    • Scraping Flow
  • Browser Sessions
    • Session Management
    • Start Session
    • Close Session
    • Health
  • Browser Navigation
    • Observe Page
    • Step In Page
    • Scrape Data
Powered by GitBook
On this page
  • Endpoint
  • Authorizations
  • Body
  • Response
  • Example Request
  • Example Response
  1. Browser Navigation

Scrape Data

This endpoint scrapes data from a specified URL within the session’s environment.


Endpoint

POST /env/scrape


Authorizations

  • Authorization (required):

    • Type: string

    • Location: Header

    • Description: The access token received from the authorization server in the OAuth 2.0 flow.


Body

Content Type: application/json

  • keep_alive:

    • Type: boolean

    • Default: false

    • Description: If true, the session will not be closed after the operation is completed.

  • max_nb_actions:

    • Type: integer

    • Default: 100

    • Description: The maximum number of actions to list after which the listing will stop. Used when min_nb_actions is not provided.

  • min_nb_actions:

    • Type: integer | null

    • Description: The minimum number of actions to list before stopping. If not provided, the listing will continue until max_nb_actions is reached.

  • only_main_content:

    • Type: boolean

    • Default: true

    • Description: Whether to only scrape the main content of the page. If true, navbars, footers, etc., are excluded.

  • scrape_images:

    • Type: boolean

    • Default: false

    • Description: Whether to scrape images from the page. Images are not scraped by default.

  • screenshot:

    • Type: boolean | null

    • Description: Whether to include a screenshot in the response.

  • session_id:

    • Type: string | null

    • Description: The ID of the session. A new session is created if not provided.

  • session_timeout_minutes:

    • Type: integer

    • Default: 5

    • Description: Session timeout in minutes. Cannot exceed the global timeout.

    • Range: 0 < x ≤ 30

  • url:

    • Type: string | null

    • Description: The URL to observe. If not provided, uses the current page URL.


Response

Response Parameters

  • metadata (required):

    • Type: object

    • Description: Metadata of the current page, including URL, title, and snapshot timestamp.

    • Attributes:

      • metadata.title (required): string - The title of the page.

      • metadata.url (required): string - The URL of the page.

      • metadata.timestamp (required): string - The timestamp when the scrape was performed.

  • session (required):

    • Type: object

    • Description: Browser session information.

    • Attributes:

      • session.created_at (required): string - Session creation time.

      • session.duration (required): string - Session duration.

      • session.last_accessed_at (required): string - Last access time.

      • session.session_id (required): string - The ID of the session.

      • session.status (required): enum<string> - Session status. Options: active, closed, error, timed_out.

      • session.timeout_minutes (required): integer - Session timeout in minutes.

      • session.error (optional): string | null - Error message if the operation failed to complete.

  • data (optional):

    • Type: object

    • Description: Extracted data from the page.

    • Attributes:

      • data.images (optional): object[] - List of images extracted from the page (ID and download link).

      • data.markdown (optional): string | null - Markdown representation of the extracted data.

      • data.structured (optional): object[] | null - Structured data extracted from the page in JSON format.

  • screenshot (optional):

    • Type: file | null

    • Description: Base64-encoded screenshot of the current page.

  • space (optional):

    • Type: object

    • Description: Available actions in the current state.

    • Attributes:

      • space.actions (required): object[] - List of available actions in the current state.

      • space.description (required): string - Human-readable description of the current webpage.

      • space.special_actions (optional): object[] - List of special browser actions.


Example Request

curl --location \
--request POST 'https://api.notexai.pro/env/scrape' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer your-api-key' \
--data '{
    "session_id": "abcd1234-5678-90ef-ghij-klmnopqrstuv",
    "url": "https://example.com",
    "scrape_images": true,
    "only_main_content": true,
    "screenshot": true
}'

Example Response

200 - application/json

{
    "metadata": {
        "title": "Example Page Title",
        "url": "https://example.com",
        "timestamp": "2025-01-24T16:00:00Z"
    },
    "session": {
        "created_at": "2025-01-24T15:00:00Z",
        "duration": "10 minutes",
        "last_accessed_at": "2025-01-24T15:50:00Z",
        "session_id": "abcd1234-5678-90ef-ghij-klmnopqrstuv",
        "status": "active",
        "timeout_minutes": 10,
        "error": null
    },
    "data": {
        "images": [
            {
                "id": "image1",
                "url": "https://example.com/image1.jpg"
            }
        ],
        "markdown": "# Example Page\nContent goes here.",
        "structured": null
    },
    "screenshot": "...base64-encoded-data...",
    "space": {
        "description": "This page allows users to perform various actions.",
        "actions": [
            {
                "id": "action1",
                "description": "Search for items."
            }
        ]
    }
}
PreviousStep In Page

Last updated 4 months ago