Class: Crawler

Defined in: src/kermit/Crawler.coffee
Inherits: Mixin

Overview

The Crawler coordinates execution of submitted RequestItem by applying all Extensions matching the item's current phase.

All functionality for item handling, such as filtering, queueing, streaming, storing, logging etc. is implemented as Extensions to ExtensionPoints.

Extensions are added to extension points during initialization. Core extensions are added automatically, user extensions are specified in the options of the Crawler's constructor.

The crawler defines an extension point for each distinct value of ProcessingPhase. Each ExtensionPoint wraps the processing steps carried out when the item phase changes to a new value. The phase transitions implicitly define a item flow illustrated in the diagram below.

Examples:

Configuration Parameters

  name      : "kermit"
  basedir   : "/tmp/sloth"
  extensions: [] # Clients can add extensions
  options   : # Options of each core extension can be customized here
    Logging : LogConfig.detailed
    Queueing   : {} # Options for the queuing system, see [QueueWorker] and [QueueConnector]
    Streaming: {} # Options for the [Streamer]
    Filtering  : {} # Options for item filtering, [RequestFilter],[DuplicatesFilter]
    Scheduling : {} # Options for the [Scheduler]
      maxWaiting: 50
      msPerUrl: 50

Events:

@command.start

Fired when crawling is started

@command.ststop

Fired when crawling is stopped

Instance Method Summary

Constructor Details

# (void) constructor(options = {})

Create a new crawler with the given options

Parameters:

  • options ( Object ) The configuration for this crawler.

See also:

Instance Method Details

# (void) start()

Start crawling. All queued commands will be executed after "commands.start" message was sent to all listeners.

# (void) stop(done)

Stop crawling. Unfinished RequestItems will be brought into terminal phase COMPLETE, CANCELED, ERROR with normal operation. {UrlScheduler} and QueueWorker and all other extensions will receive the "commands.stop" message. QueueSystem will be persisted, then the optional callback will be invoked.

# (void) shutdown()

Stop crawling and exit process.

# (void) on(event, handler)

# (RequestItem) crawl(url, meta)

Create a new RequestItem and start its processing

Returns:

# (void) schedule(url, meta)

Add the url to the Scheduler

# (void) execute(command)

# (void) toString()

Pretty print this crawler

    Quickly fuzzy find classes, mixins, methods, file:

    Control the navigation frame:

    You can focus and blur the search input: