Skip to content

crawl

Web crawling and documentation ingestion. crawl registers documentation sources, ingests them into the content store, and links the crawled chunks into the knowledge graph so they surface in sibyl search.

Commands

CommandDescription
sibyl crawl listList crawl sources
sibyl crawl addAdd a new documentation source
sibyl crawl ingestStart crawling a source
sibyl crawl statusGet crawl status for a source
sibyl crawl showShow crawl source details
sibyl crawl statsShow crawling statistics
sibyl crawl healthCheck crawl system health
sibyl crawl deleteDelete a source and all its documents
sibyl crawl link-statusShow pending graph linking work per source
sibyl crawl link-graphLink crawled chunks into the graph
sibyl crawl documentsBrowse crawled documents

Workflow

add  ->  ingest  ->  link-graph  ->  search
 |         |           |
 source    documents   graph entities

Register a source with add, crawl it with ingest, then link-graph so the chunks become graph entities. link-status shows what still needs linking.


crawl list

List crawl sources.

bash
sibyl crawl list [options]
OptionShortDefaultDescription
--status-s(all)Filter by status
--limit-n20Max results
--json-jfalseJSON output

crawl add

Add a new documentation source.

bash
sibyl crawl add <url> [options]
ArgumentRequiredDescription
urlYesDocumentation URL to add
OptionShortDefaultDescription
--name-n(derived)Source name
--type-TwebsiteSource type: website, github, api_docs
--depth-d2Crawl depth
--pattern / --include-p(none)URL patterns to include
--json-jfalseJSON output

Example

bash
sibyl crawl add https://docs.example.com \
  --name "Example Docs" --type website --depth 3 \
  --pattern "/guide/*"

crawl ingest

Start crawling a documentation source.

bash
sibyl crawl ingest <source_id> [options]
ArgumentRequiredDescription
source_idYesSource ID to crawl
OptionShortDefaultDescription
--max-pages-p50Maximum pages to crawl
--depth-d3Maximum link depth
--no-embedfalseSkip embedding generation
--json-jfalseJSON output

Examples

bash
sibyl crawl ingest abc123 --max-pages 100
sibyl crawl ingest abc123 --depth 2 --no-embed

crawl status

Get the status of a crawl source using the current source-status contract.

bash
sibyl crawl status <source_id> [options]
OptionShortDescription
--json-jJSON output

crawl show

Show crawl source details.

bash
sibyl crawl show <source_id> [options]
OptionShortDescription
--json-jJSON output

crawl stats

Show crawling statistics across all sources.

bash
sibyl crawl stats [--json]

crawl health

Check crawl system health.

bash
sibyl crawl health [--json]

crawl delete

Delete a crawl source and all its documents.

bash
sibyl crawl delete <source_id> [options]
OptionShortDescription
--json-jJSON output

Show pending graph linking work per source. Use this to see how many crawled chunks still need to be linked into the graph.

bash
sibyl crawl link-status [--json]

Link crawled chunks into the graph. Pass a source ID, or all to process every source.

bash
sibyl crawl link-graph [source_id] [options]
ArgumentRequiredDescription
source_idNoSource ID, or all for all sources
OptionShortDefaultDescription
--batch-b50Batch size
--dry-run-nfalseShow what would be processed
--create-newfalseCreate graph entities for unlinked extractions
--json-jfalseJSON output

Examples

bash
# Dry-run linking for one source
sibyl crawl link-graph abc123 --dry-run

# Link all sources, creating entities for new extractions
sibyl crawl link-graph all --create-new

crawl documents

Browse crawled documents.

crawl documents list

List crawled documents.

bash
sibyl crawl documents list [options]
OptionShortDefaultDescription
--source-s(all)Filter by source ID
--limit-n20Max results
--json-jfalseJSON output

crawl documents show

Show full document content. Use the document_id from search result metadata.

bash
sibyl crawl documents show <document_id> [options]
ArgumentRequiredDescription
document_idYesDocument ID from search result metadata
OptionShortDescription
--raw-rShow raw markdown content
--json-jJSON output

Example

bash
sibyl search "proto config"
# note the document_id in result metadata
sibyl crawl documents show 22d4cf79-8561-4be0-8067-da8673e3439d

Released under the Apache-2.0 License.