Archived

No description

This repository has been archived on 2026-06-16. You can view files and clone it, but you cannot make any changes to its state, such as pushing and creating new issues, pull requests or comments.

Java 99.6%
Shell 0.4%

Find a file

doufenghu 64a4ffb63a Test method remove "throws Exception".		2024-11-26 19:29:46 +08:00
.mvn/wrapper	support mvnw command	2024-11-14 09:58:56 +08:00
bin	[Improve][bootstrap] Inprove install cn udf script. Add Rate Limiting test case.	2024-05-23 14:17:00 +08:00
config	[feature][core]CN-1730 修改函数类名为ArrayConcatAgg	2024-11-26 18:39:28 +08:00
develop	chore: add JavaStyle and FindBugs related plugins and configs	2024-01-04 15:31:40 +08:00
docs	Add ARRAY_CONCAT_AGG aggregate function desc	2024-11-25 13:40:37 +08:00
groot-api	[improve][bootstrap]修改原UDFUtil工具类位置，后期加载udf函数方式待优化	2024-11-19 16:59:37 +08:00
groot-bootstrap	[feature][core]CN-1730 回滚CollectList和CollectSet,新增数组聚合函数ArrayContactAgg及相关单元测试	2024-11-26 18:13:48 +08:00
groot-common	[feature][core]CN-1730 修改函数类名为ArrayConcatAgg	2024-11-26 18:39:28 +08:00
groot-connectors	[Feature][api] AviatorFilterProcessorFactory renamed to FilterProcessorFactory.The Factory add supportsType method for supporting legency type of avaitor.	2024-11-14 09:54:50 +08:00
groot-core	Test method remove "throws Exception".	2024-11-26 19:29:46 +08:00
groot-examples	[Improve][e2e] Rename all e2e test modules to adapt to changes in the API operators.	2024-11-23 19:23:16 +08:00
groot-formats	[Improve][format] format添加测试	2024-11-18 18:15:40 +08:00
groot-release	[Feature][SPI] 增加groot-spi模块，解耦core和common模块之间的复杂依赖关系，移除一些不需要的类库。	2024-11-09 20:01:24 +08:00
groot-shaded	[Improve][Core] Knowledge Handler use http client shaded library.	2024-07-24 11:52:22 +08:00
groot-tests	[Improve][e2e] Rename all e2e test modules to adapt to changes in the API operators.	2024-11-23 19:23:16 +08:00
plugins	[Improve][bootstrap] Inprove install cn udf script. Add Rate Limiting test case.	2024-05-23 14:17:00 +08:00
.gitignore	update .gitignore	2024-01-29 15:09:05 +08:00
.gitlab-ci.yml	Update .gitlab-ci.yml	2024-07-16 11:25:08 +00:00
mvnw	[improve][bootstrap] support mvnw build project	2023-12-13 18:08:56 +08:00
plugin-mapping.properties	[tests][e2e-clickhouse] Support the ingestion of common data types by a Flink job	2024-08-11 23:31:30 +08:00
pom.xml	[Improve][core] Preprocessing 和 postprocessing 标识已过期，后续任务将被移除。修复了 Split side output 下游节点存在其他边无法正确构建拓扑的问题。	2024-11-16 00:26:30 +08:00
README.md	[docs][udf] Update scalar UDFs, user-defined aggregate functions (UDAFs) description.	2024-08-15 16:18:41 +08:00

README.md

Groot Stream Platform

Groot Stream Platform helps you process netflow data - logs, metrics etc. - in real time, high reliability and high performance, distributed data integration and synchronization tool.

Features
Groot Stream Workflow
Supported Connectors & Functions
Minimum Requirements
Getting Started
Documentation

Features

Groot Stream is designed to simplify the operation of ETL (Extract, Transform, Load). It efficiently collects data from multiple sources and processes and enriches it.

Real-time data processing: Using Flink as the execution engine, it can provide high throughput and low-latency processing capabilities for large-scale data streams.
Designed for extension: Plugin-based management that support for User-defined Functions, Sources, and Sinks.
Highly Configurable: Customize data flow through YML templates to swiftly fulfill ETL requirements without development.
Out-of-the-box Functions: Built-in functions for data processing, including data type conversion, data filtering, data aggregation, and data enrichment.

Groot Stream Workflow

Configure a job, you'll set up Sources, Filters, Processing Pipeline, and Sinks, and will assemble several built-in functions into a Processing Pipeline. The job will then be deployed to a Flink cluster for execution.

Source: The data source of the job, which can be a Kafka topic, a IPFIX Collector, or a file.
Filter: Filters data based on specified conditions.
Types of Pipelines: The fundamental unit of data stream processing is the processor, categorized by functionality into stateless and stateful processors. Each processor can be assemble UDFs(User-defined functions) into a pipeline. There are 3 types of pipelines at different stages of the data processing process:
- Pre-processing Pipeline: Optional. These pipelines that are attached to a source to normalize the events before they enter the processing pipeline.
- Processing Pipeline: Event processing pipeline.
- Post-processing Pipeline: Optional. These pipelines that are attached to a sink to normalize the events before they're written to the sink.
Sink: The data sink of the job, which can be a Kafka topic, a ClickHouse table, or a file.

Supported Connectors & Processors & Functions

Minimum Requirements

Git installed
JAVA（JDK/JRE11 are required）installed and JAVA_HOME set
Maven 3.5.4
Scala 2.12
Flink 1.13.1

Getting Started

Building

Run the following Maven command to build the project modules using parallel threads:

./mvnw clean install -T2C

Run the following Maven command to build the project modules and Skip Tests:

./mvnw clean install -DskipTests

Deploying

1.Download the release package

Download the latest release package from the Releases. Copy the groot-release/target/groot-stream-${version}-bin.tar.gz file to the target machine and extract it:

tar -zxvf groot-stream-${version}-bin.tar.gz
ls -lh groot-stream-${version}

2. Configure the environment

You need to configure Flink engine environment variables in config/grootstream-env.sh file.Default will use system environment variables. If not set, it will use the default value for the following variables:

FLINK_HOME=${FLINK_HOME:-/opt/flink}
FLINK_JOB_MANAGER_ADDRESS=${FLINK_JOB_MANAGER_ADDRESS:-localhost:8081}
YARN_ADDRESS=${YARN_ADDRESS:-yarn-cluster}

3. Configure the groot-stream job

You need to configure the groot-stream job in config/grootstream_job_example.yaml file. More information about config please check config concept

4. Submit a job to flink engine

Can be started by a daemon with -d.

  ./bin/start.sh -c *.yaml -d

Starting

Running job in your IDE

Set groot-bootstrap module pom.xml scope to compile.
Open the Run/Debug Configurations window.
Choose -cp groot-bootstrap
Choose Main Class com.geedgenetworks.bootstrap.main.GrootStreamServer.
Add VM options --target local -c /...../groot-stream/config/grootstream_job_example.yaml.
Click the Run button.

Running the CLI

Run the following command to start the groot-stream server for Standalone Mode:

cd "groot-stream-${version}"
./bin/start.sh -c ./config/grootstream_job_example.yaml --target remote -n inline-to-print-job -d

Run the following command to start the groot-stream server for Yarn Session Mode:

# First create a yarn session cluster 
yarn-session.sh -d  
# Then start the groot-stream server for Yarn Session Mode. 
cd "groot-stream-${version}"
./bin/start.sh -c ./config/grootstream_job_example.yaml --target yarn-session -Dyarn.application.id=application_XXXX_YY -n inline-to-print-job -d

Run the following command to start the groot-stream server for Yarn Per-job Mode:

cd "groot-stream-${version}"
./bin/start.sh -c ./config/grootstream_job_example.yaml --target yarn-per-job -Dyarn.application.name="inline-to-print-job" Djobmanager.memory.process.size=1024m -Dtaskmanager.memory.process.size=2048m -Dtaskmanager.numberOfTaskSlots=3  -p 6  -n inline-to-print-job -d

Configuring

The User Guide provides detailed information on how to configure a job.

Documentation

See the Groot Stream Documentation for more information.

Contributors

All developers see the list of contributors here.

README.md Unescape Escape