Scrape Data

This endpoint scrapes data from a specified URL within the session’s environment.

Endpoint

POST /env/scrape

Authorizations

Authorization (required):
- Type: string
- Location: Header
- Description: The access token received from the authorization server in the OAuth 2.0 flow.

Body

Content Type: application/json

keep_alive:
- Type: boolean
- Default: false
- Description: If true, the session will not be closed after the operation is completed.
max_nb_actions:
- Type: integer
- Default: 100
- Description: The maximum number of actions to list after which the listing will stop. Used when min_nb_actions is not provided.
min_nb_actions:
- Type: integer | null
- Description: The minimum number of actions to list before stopping. If not provided, the listing will continue until max_nb_actions is reached.
only_main_content:
- Type: boolean
- Default: true
- Description: Whether to only scrape the main content of the page. If true, navbars, footers, etc., are excluded.
scrape_images:
- Type: boolean
- Default: false
- Description: Whether to scrape images from the page. Images are not scraped by default.
screenshot:
- Type: boolean | null
- Description: Whether to include a screenshot in the response.
session_id:
- Type: string | null
- Description: The ID of the session. A new session is created if not provided.
session_timeout_minutes:
- Type: integer
- Default: 5
- Description: Session timeout in minutes. Cannot exceed the global timeout.
- Range: 0 < x ≤ 30
url:
- Type: string | null
- Description: The URL to observe. If not provided, uses the current page URL.

Response

Response Parameters

metadata (required):
- Type: object
- Description: Metadata of the current page, including URL, title, and snapshot timestamp.
- Attributes:
  - metadata.title (required): string - The title of the page.
  - metadata.url (required): string - The URL of the page.
  - metadata.timestamp (required): string - The timestamp when the scrape was performed.
session (required):
- Type: object
- Description: Browser session information.
- Attributes:
  - session.created_at (required): string - Session creation time.
  - session.duration (required): string - Session duration.
  - session.last_accessed_at (required): string - Last access time.
  - session.session_id (required): string - The ID of the session.
  - session.status (required): enum<string> - Session status. Options: active, closed, error, timed_out.
  - session.timeout_minutes (required): integer - Session timeout in minutes.
  - session.error (optional): string | null - Error message if the operation failed to complete.
data (optional):
- Type: object
- Description: Extracted data from the page.
- Attributes:
  - data.images (optional): object[] - List of images extracted from the page (ID and download link).
  - data.markdown (optional): string | null - Markdown representation of the extracted data.
  - data.structured (optional): object[] | null - Structured data extracted from the page in JSON format.
screenshot (optional):
- Type: file | null
- Description: Base64-encoded screenshot of the current page.
space (optional):
- Type: object
- Description: Available actions in the current state.
- Attributes:
  - space.actions (required): object[] - List of available actions in the current state.
  - space.description (required): string - Human-readable description of the current webpage.
  - space.special_actions (optional): object[] - List of special browser actions.

Example Request

curl --location \
--request POST 'https://api.notexai.pro/env/scrape' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer your-api-key' \
--data '{
    "session_id": "abcd1234-5678-90ef-ghij-klmnopqrstuv",
    "url": "https://example.com",
    "scrape_images": true,
    "only_main_content": true,
    "screenshot": true
}'

Example Response

200 - application/json

{
    "metadata": {
        "title": "Example Page Title",
        "url": "https://example.com",
        "timestamp": "2025-01-24T16:00:00Z"
    },
    "session": {
        "created_at": "2025-01-24T15:00:00Z",
        "duration": "10 minutes",
        "last_accessed_at": "2025-01-24T15:50:00Z",
        "session_id": "abcd1234-5678-90ef-ghij-klmnopqrstuv",
        "status": "active",
        "timeout_minutes": 10,
        "error": null
    },
    "data": {
        "images": [
            {
                "id": "image1",
                "url": "https://example.com/image1.jpg"
            }
        ],
        "markdown": "# Example Page\nContent goes here.",
        "structured": null
    },
    "screenshot": "...base64-encoded-data...",
    "space": {
        "description": "This page allows users to perform various actions.",
        "actions": [
            {
                "id": "action1",
                "description": "Search for items."
            }
        ]
    }
}

PreviousStep In Page

Last updated 4 months ago

curl --location \ --request POST 'https://api.notexai.pro/env/scrape' \ --header 'Content-Type: application/json' \ --header 'Authorization: Bearer your-api-key' \ --data '{ "session_id": "abcd1234-5678-90ef-ghij-klmnopqrstuv", "url": "https://example.com", "scrape_images": true, "only_main_content": true, "screenshot": true }'

{ "metadata": { "title": "Example Page Title", "url": "https://example.com", "timestamp": "2025-01-24T16:00:00Z" }, "session": { "created_at": "2025-01-24T15:00:00Z", "duration": "10 minutes", "last_accessed_at": "2025-01-24T15:50:00Z", "session_id": "abcd1234-5678-90ef-ghij-klmnopqrstuv", "status": "active", "timeout_minutes": 10, "error": null }, "data": { "images": [ { "id": "image1", "url": "https://example.com/image1.jpg" } ], "markdown": "# Example Page\nContent goes here.", "structured": null }, "screenshot": "...base64-encoded-data...", "space": { "description": "This page allows users to perform various actions.", "actions": [ { "id": "action1", "description": "Search for items." } ] } }