- Java 99.6%
- Shell 0.4%
| .mvn/wrapper | ||
| bin | ||
| config | ||
| develop | ||
| docs | ||
| groot-api | ||
| groot-bootstrap | ||
| groot-common | ||
| groot-connectors | ||
| groot-core | ||
| groot-examples | ||
| groot-formats | ||
| groot-release | ||
| groot-shaded | ||
| groot-tests | ||
| plugins | ||
| .gitignore | ||
| .gitlab-ci.yml | ||
| mvnw | ||
| plugin-mapping.properties | ||
| pom.xml | ||
| README.md | ||
Groot Stream Platform
Groot Stream Platform helps you process netflow data - logs, metrics etc. - in real time, high reliability and high performance, distributed data integration and synchronization tool.
Table of contents
- Features
- Groot Stream Workflow
- Supported Connectors & Functions
- Minimum Requirements
- Getting Started
- Documentation
Features
Groot Stream is designed to simplify the operation of ETL (Extract, Transform, Load). It efficiently collects data from multiple sources and processes and enriches it.
- Real-time data processing: Using Flink as the execution engine, it can provide high throughput and low-latency processing capabilities for large-scale data streams.
- Designed for extension: Plugin-based management that support for User-defined Functions, Sources, and Sinks.
- Highly Configurable: Customize data flow through YML templates to swiftly fulfill ETL requirements without development.
- Out-of-the-box Functions: Built-in functions for data processing, including data type conversion, data filtering, data aggregation, and data enrichment.
Groot Stream Workflow
Configure a job, you'll set up Sources, Filters, Processing Pipeline, and Sinks, and will assemble several built-in functions into a Processing Pipeline. The job will then be deployed to a Flink cluster for execution.
- Source: The data source of the job, which can be a Kafka topic, a IPFIX Collector, or a file.
- Filter: Filters data based on specified conditions.
- Types of Pipelines: The fundamental unit of data stream processing is the processor, categorized by functionality into stateless and stateful processors. Each processor can be assemble
UDFs(User-defined functions) into a pipeline. There are 3 types of pipelines at different stages of the data processing process:- Pre-processing Pipeline: Optional. These pipelines that are attached to a source to normalize the events before they enter the processing pipeline.
- Processing Pipeline: Event processing pipeline.
- Post-processing Pipeline: Optional. These pipelines that are attached to a sink to normalize the events before they're written to the sink.
- Sink: The data sink of the job, which can be a Kafka topic, a ClickHouse table, or a file.
Supported Connectors & Processors & Functions
Minimum Requirements
- Git installed
- JAVA(JDK/JRE11 are required)installed and
JAVA_HOMEset - Maven 3.5.4
- Scala 2.12
- Flink 1.13.1
Getting Started
Building
Run the following Maven command to build the project modules using parallel threads:
./mvnw clean install -T2C
Run the following Maven command to build the project modules and Skip Tests:
./mvnw clean install -DskipTests
Deploying
1.Download the release package
Download the latest release package from the Releases.
Copy the groot-release/target/groot-stream-${version}-bin.tar.gz file to the target machine and extract it:
tar -zxvf groot-stream-${version}-bin.tar.gz
ls -lh groot-stream-${version}
2. Configure the environment
You need to configure Flink engine environment variables in config/grootstream-env.sh file.Default will use system environment variables. If not set, it will use the default value for the following variables:
FLINK_HOME=${FLINK_HOME:-/opt/flink}
FLINK_JOB_MANAGER_ADDRESS=${FLINK_JOB_MANAGER_ADDRESS:-localhost:8081}
YARN_ADDRESS=${YARN_ADDRESS:-yarn-cluster}
3. Configure the groot-stream job
You need to configure the groot-stream job in config/grootstream_job_example.yaml file. More information about config please check config concept
4. Submit a job to flink engine
Can be started by a daemon with -d.
./bin/start.sh -c *.yaml -d
Starting
Running job in your IDE
- Set
groot-bootstrapmodule pom.xml scope tocompile. - Open the
Run/Debug Configurationswindow. - Choose -cp
groot-bootstrap - Choose Main Class
com.geedgenetworks.bootstrap.main.GrootStreamServer. - Add VM options
--target local -c /...../groot-stream/config/grootstream_job_example.yaml. - Click the
Runbutton.
Running the CLI
- Run the following command to start the groot-stream server for Standalone Mode:
cd "groot-stream-${version}"
./bin/start.sh -c ./config/grootstream_job_example.yaml --target remote -n inline-to-print-job -d
- Run the following command to start the groot-stream server for Yarn Session Mode:
# First create a yarn session cluster
yarn-session.sh -d
# Then start the groot-stream server for Yarn Session Mode.
cd "groot-stream-${version}"
./bin/start.sh -c ./config/grootstream_job_example.yaml --target yarn-session -Dyarn.application.id=application_XXXX_YY -n inline-to-print-job -d
- Run the following command to start the groot-stream server for Yarn Per-job Mode:
cd "groot-stream-${version}"
./bin/start.sh -c ./config/grootstream_job_example.yaml --target yarn-per-job -Dyarn.application.name="inline-to-print-job" Djobmanager.memory.process.size=1024m -Dtaskmanager.memory.process.size=2048m -Dtaskmanager.numberOfTaskSlots=3 -p 6 -n inline-to-print-job -d
Configuring
The User Guide provides detailed information on how to configure a job.
Documentation
See the Groot Stream Documentation for more information.
Contributors
All developers see the list of contributors here.
