RDP Data Crawler

Periodically fetches publicly available forecast and measurement information and stores it into Redis streams

View the Project on GitHub AIT-RDP/rdp-data-crawler

RDP Data Crawler

The RDP Data Crawler mainly interfaces external systems and the AIT RDP. It either periodically or event-driven fetches data from various sources such as forecasts or measurement information and stores the data into Redis streams. In addition, the AIT Data Crawler can push data to external systems such as Modbus or OPC UA devices.

Installation and System Integration

The RDP Data Crawler is designed to be integrated as Docker container into the AIT RDP. The main docker image is available at Docker Hub via ait1/rdp-data-crawler. In addition to version tags, the following are supported:

Since the configurations are commonly rather complex, a direct configuration via environment variables is not feasible. Instead, a configuration file or directory is mounted. Per default, the configuration is located at /etc/data_crawler/config.yml. Nevertheless, the whole /etc/data_crawler/ directory can be mounted in case sub-configuration files are needed. The following example shows a basic docker-compose service definition:

services:
  # ...
  data-crawler:
    image: ait1/rdp-data-crawler:latest-dev
    volumes:
      - ./data-crawler/config.yml:/etc/data_crawler/config.yml:ro
    environment:
      REDIS_USERNAME: ${REDIS_USERNAME}
      REDIS_PASSWORD: ${REDIS_PASSWORD}
    depends_on:
      - redis
    restart: unless-stopped

For other installation methods, including custom data sources and development setups, please refer to the Advanced Installation section.

Configuration Structure

The configuration is done via a YAML file that supports the AIT RDP extensions such as variables substitution and templating. The configuration is organized into multiple channels that are described from the perspective of the data sources. For each channel, a single source is created and the data from that source is accessed, processed and written to a destination. In most cases, either the destination (default) or the source will be an internal Redis. However, also different configurations are supported. Hence, a single data crawler process can handle multiple sources concurrently and consequently reduces the overhead of spinning up a large amount of containers. The basic structure of the configuration is as follows:

version: 1  # Configuration file version, mostly to ensure later compatibility

# List of sources that will spin up a dedicated channel for each source
data sources:
  source.name.0:  # Unique name of the channel. Mostly used for debugging
    type: <source_type>  # Type of the source to instantiate
    source parameter: {}  # Source-specific parameters are defined here
    # The polling section describes the timing of passive data sources that are executed periodically. Active data
    # sources that listen on external events may emit messages at any time. Hence, the polling section can be omitted
    # for these sources.
    polling:
      frequency: 5min

    sink_type: <sink_type>  # Type of the sink
    sink_parameters: {}  # Sink-specific parameters are defined here

  source.name.1: # Another source
    # ...

Each section has at lest a source and some sink configuration. Both sources and sinks are dynamically loaded by the respective type. Source- and sink-specific parameters can be passed on in the source parameter and sink_parameters sections, respectively. In case a Redis sink is used, the configuration can be simplified by omitting the sink_type and sink_parameters sections and appending a redis section, instead:

version: 1
data sources:
  source.name.0:
    type: <source_type>
    source parameter: {}
    polling:
      frequency: 5min
    # The Redis section replaces the sink_type and sink_parameters sections.
    redis:
      stream: <stream_name>  # Name of the Redis stream to write to.
      tags:  # Optional tags to be added to the Redis stream.
        message-key-0: message-value-0
        message-key-1: message-value-1

Note that the redis sink allows to append arbitrary but static message fields. This can be used to set meta-data that is needed for further data processing.

Polling Configuration

Sources that require regular polling can be configured via the polling section. In order to determine the timing and avoid overloading various sources, a fine-grained control is possible. At minimum, the frequency parameter must be set. The complete list of parameters is as follows:

In case the processing duration exceeds the configured frequency, the next operation will be immediately scheduled. If the delay exceeds the following regular interval, the triggering point will be skipped in order to avoid pile-up of delays and unpredictable timing.

Data Sources and Data Sinks

The default distribution of the AIT RDP Data Crawler already supports a broad variety of data sources and data sinks. The following overview lists the main ones. Detailed configurations can be found in the data source and data sink description.