No description
This repository has been archived on 2026-06-16. You can view files and clone it, but you cannot make any changes to its state, such as pushing and creating new issues, pull requests or comments.
  • Java 99.6%
  • Shell 0.4%
Find a file
2024-11-26 19:29:46 +08:00
.mvn/wrapper support mvnw command 2024-11-14 09:58:56 +08:00
bin [Improve][bootstrap] Inprove install cn udf script. Add Rate Limiting test case. 2024-05-23 14:17:00 +08:00
config [feature][core]CN-1730 修改函数类名为ArrayConcatAgg 2024-11-26 18:39:28 +08:00
develop chore: add JavaStyle and FindBugs related plugins and configs 2024-01-04 15:31:40 +08:00
docs Add ARRAY_CONCAT_AGG aggregate function desc 2024-11-25 13:40:37 +08:00
groot-api [improve][bootstrap]修改原UDFUtil工具类位置,后期加载udf函数方式待优化 2024-11-19 16:59:37 +08:00
groot-bootstrap [feature][core]CN-1730 回滚CollectList和CollectSet,新增数组聚合函数ArrayContactAgg及相关单元测试 2024-11-26 18:13:48 +08:00
groot-common [feature][core]CN-1730 修改函数类名为ArrayConcatAgg 2024-11-26 18:39:28 +08:00
groot-connectors [Feature][api] AviatorFilterProcessorFactory renamed to FilterProcessorFactory.The Factory add supportsType method for supporting legency type of avaitor. 2024-11-14 09:54:50 +08:00
groot-core Test method remove "throws Exception". 2024-11-26 19:29:46 +08:00
groot-examples [Improve][e2e] Rename all e2e test modules to adapt to changes in the API operators. 2024-11-23 19:23:16 +08:00
groot-formats [Improve][format] format添加测试 2024-11-18 18:15:40 +08:00
groot-release [Feature][SPI] 增加groot-spi模块,解耦core和common模块之间的复杂依赖关系,移除一些不需要的类库。 2024-11-09 20:01:24 +08:00
groot-shaded [Improve][Core] Knowledge Handler use http client shaded library. 2024-07-24 11:52:22 +08:00
groot-tests [Improve][e2e] Rename all e2e test modules to adapt to changes in the API operators. 2024-11-23 19:23:16 +08:00
plugins [Improve][bootstrap] Inprove install cn udf script. Add Rate Limiting test case. 2024-05-23 14:17:00 +08:00
.gitignore update .gitignore 2024-01-29 15:09:05 +08:00
.gitlab-ci.yml Update .gitlab-ci.yml 2024-07-16 11:25:08 +00:00
mvnw [improve][bootstrap] support mvnw build project 2023-12-13 18:08:56 +08:00
plugin-mapping.properties [tests][e2e-clickhouse] Support the ingestion of common data types by a Flink job 2024-08-11 23:31:30 +08:00
pom.xml [Improve][core] Preprocessing 和 postprocessing 标识已过期,后续任务将被移除。修复了 Split side output 下游节点存在其他边无法正确构建拓扑的问题。 2024-11-16 00:26:30 +08:00
README.md [docs][udf] Update scalar UDFs, user-defined aggregate functions (UDAFs) description. 2024-08-15 16:18:41 +08:00

Groot Stream Platform

Groot Stream Platform helps you process netflow data - logs, metrics etc. - in real time, high reliability and high performance, distributed data integration and synchronization tool.

Table of contents

Features

Groot Stream is designed to simplify the operation of ETL (Extract, Transform, Load). It efficiently collects data from multiple sources and processes and enriches it.
  • Real-time data processing: Using Flink as the execution engine, it can provide high throughput and low-latency processing capabilities for large-scale data streams.
  • Designed for extension: Plugin-based management that support for User-defined Functions, Sources, and Sinks.
  • Highly Configurable: Customize data flow through YML templates to swiftly fulfill ETL requirements without development.
  • Out-of-the-box Functions: Built-in functions for data processing, including data type conversion, data filtering, data aggregation, and data enrichment.

Groot Stream Workflow

Groot Stream Workflow

Configure a job, you'll set up Sources, Filters, Processing Pipeline, and Sinks, and will assemble several built-in functions into a Processing Pipeline. The job will then be deployed to a Flink cluster for execution.

  • Source: The data source of the job, which can be a Kafka topic, a IPFIX Collector, or a file.
  • Filter: Filters data based on specified conditions.
  • Types of Pipelines: The fundamental unit of data stream processing is the processor, categorized by functionality into stateless and stateful processors. Each processor can be assemble UDFs(User-defined functions) into a pipeline. There are 3 types of pipelines at different stages of the data processing process:
    • Pre-processing Pipeline: Optional. These pipelines that are attached to a source to normalize the events before they enter the processing pipeline.
    • Processing Pipeline: Event processing pipeline.
    • Post-processing Pipeline: Optional. These pipelines that are attached to a sink to normalize the events before they're written to the sink.
  • Sink: The data sink of the job, which can be a Kafka topic, a ClickHouse table, or a file.

Supported Connectors & Processors & Functions

Minimum Requirements

  • Git installed
  • JAVAJDK/JRE11 are requiredinstalled and JAVA_HOME set
  • Maven 3.5.4
  • Scala 2.12
  • Flink 1.13.1

Getting Started

Building

Run the following Maven command to build the project modules using parallel threads:

./mvnw clean install -T2C

Run the following Maven command to build the project modules and Skip Tests:

./mvnw clean install -DskipTests

Deploying

1.Download the release package

Download the latest release package from the Releases. Copy the groot-release/target/groot-stream-${version}-bin.tar.gz file to the target machine and extract it:

tar -zxvf groot-stream-${version}-bin.tar.gz
ls -lh groot-stream-${version}

2. Configure the environment

You need to configure Flink engine environment variables in config/grootstream-env.sh file.Default will use system environment variables. If not set, it will use the default value for the following variables:

FLINK_HOME=${FLINK_HOME:-/opt/flink}
FLINK_JOB_MANAGER_ADDRESS=${FLINK_JOB_MANAGER_ADDRESS:-localhost:8081}
YARN_ADDRESS=${YARN_ADDRESS:-yarn-cluster}

3. Configure the groot-stream job

You need to configure the groot-stream job in config/grootstream_job_example.yaml file. More information about config please check config concept

Can be started by a daemon with -d.

  ./bin/start.sh -c *.yaml -d

Starting

Running job in your IDE

  1. Set groot-bootstrap module pom.xml scope to compile.
  2. Open the Run/Debug Configurations window.
  3. Choose -cp groot-bootstrap
  4. Choose Main Class com.geedgenetworks.bootstrap.main.GrootStreamServer.
  5. Add VM options --target local -c /...../groot-stream/config/grootstream_job_example.yaml.
  6. Click the Run button.

Running the CLI

  • Run the following command to start the groot-stream server for Standalone Mode:
cd "groot-stream-${version}"
./bin/start.sh -c ./config/grootstream_job_example.yaml --target remote -n inline-to-print-job -d
  • Run the following command to start the groot-stream server for Yarn Session Mode:
# First create a yarn session cluster 
yarn-session.sh -d  
# Then start the groot-stream server for Yarn Session Mode. 
cd "groot-stream-${version}"
./bin/start.sh -c ./config/grootstream_job_example.yaml --target yarn-session -Dyarn.application.id=application_XXXX_YY -n inline-to-print-job -d
  • Run the following command to start the groot-stream server for Yarn Per-job Mode:
cd "groot-stream-${version}"
./bin/start.sh -c ./config/grootstream_job_example.yaml --target yarn-per-job -Dyarn.application.name="inline-to-print-job" Djobmanager.memory.process.size=1024m -Dtaskmanager.memory.process.size=2048m -Dtaskmanager.numberOfTaskSlots=3  -p 6  -n inline-to-print-job -d

Configuring

The User Guide provides detailed information on how to configure a job.

Documentation

See the Groot Stream Documentation for more information.

Contributors

All developers see the list of contributors here.