Class: Crawler
Defined in: | src/kermit/Crawler.coffee |
Inherits: | Mixin |
Overview
The Crawler coordinates execution of submitted RequestItem by applying all Extensions matching the item's current phase.
All functionality for item handling, such as filtering, queueing, streaming, storing, logging etc. is implemented as Extensions to ExtensionPoints.
Extensions are added to extension points during initialization. Core extensions are added automatically, user extensions are specified in the options of the Crawler's constructor.
The crawler defines an extension point for each distinct value of ProcessingPhase. Each ExtensionPoint wraps the processing steps carried out when the item phase changes to a new value. The phase transitions implicitly define a item flow illustrated in the diagram below.
Examples:
Configuration Parameters
name : "kermit"
basedir : "/tmp/sloth"
extensions: [] # Clients can add extensions
options : # Options of each core extension can be customized here
Logging : LogConfig.detailed
Queueing : {} # Options for the queuing system, see [QueueWorker] and [QueueConnector]
Streaming: {} # Options for the [Streamer]
Filtering : {} # Options for item filtering, [RequestFilter],[DuplicatesFilter]
Scheduling : {} # Options for the [Scheduler]
maxWaiting: 50
msPerUrl: 50
Instance Method Summary
- # (void) start() Start crawling.
- # (void) stop(done) Stop crawling.
- # (void) shutdown() Stop crawling and exit process.
- # (void) on(event, handler)
- # (RequestItem) crawl(url, meta) Create a new RequestItem and start its processing
- # (void) schedule(url, meta) Add the url to the Scheduler
- # (void) execute(command)
- # (void) toString() Pretty print this crawler
Constructor Details
#
(void)
constructor(options = {})
Create a new crawler with the given options
Instance Method Details
#
(void)
start()
Start crawling. All queued commands will be executed after "commands.start" message was sent to all listeners.
#
(void)
stop(done)
Stop crawling. Unfinished RequestItems will be brought into terminal phase COMPLETE, CANCELED, ERROR with normal operation. {UrlScheduler} and QueueWorker and all other extensions will receive the "commands.stop" message. QueueSystem will be persisted, then the optional callback will be invoked.
#
(void)
shutdown()
Stop crawling and exit process.
#
(void)
on(event, handler)
#
(RequestItem)
crawl(url, meta)
Create a new RequestItem and start its processing
#
(void)
schedule(url, meta)
Add the url to the Scheduler
#
(void)
execute(command)
#
(void)
toString()
Pretty print this crawler