kafka：使用flume自定义拦截器，将json文件抽取到kafka的消息队列（topic）中，再从topic中将数据抽取到hdfs上

🕗 发布于 2024-11-14 13:05 kafka flume hdfs

抽取trans_info.json的数据到kafka上，对其中的tr_flag=0的数据进行过滤抛弃，只保留正常的状态数据：将此json文件放在集群中的 /home/zidingyi/trans_info.json 目录下

首先先在java代码中自定义拦截器：

1）：创建maven项目，在pom文件中导入相关依赖

<dependencies>
    <!-- https://mvnrepository.com/artifact/org.apache.flume/flume-ng-core -->
    <dependency>
        <groupId>org.apache.flume</groupId>
        <artifactId>flume-ng-core</artifactId>
        <version>1.9.0</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/com.alibaba/fastjson -->
    <dependency>
        <groupId>com.alibaba</groupId>
        <artifactId>fastjson</artifactId>
        <version>1.2.48</version>
    </dependency>
</dependencies>

<!--可以使用maven中的某些打包插件，不仅可以帮助我们打包代码还可以打包所依赖的jar包-->

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>3.1.1</version>
            <configuration>
                <!-- 禁止生成 dependency-reduced-pom.xml-->
                <createDependencyReducedPom>false</createDependencyReducedPom>
            </configuration>
            <executions>
                <!-- Run shade goal on package phase -->
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                    <configuration>
                        <relocations>
                            <relocation>
                                <!-- 解决包冲突 进行转换-->
                                <pattern>com.google.protobuf</pattern>
                                <shadedPattern>shaded.com.google.protobuf</shadedPattern>
                            </relocation>
                        </relocations>
                        <artifactSet>
                            <excludes>
                                <exclude>log4j:*</exclude>
                            </excludes>
                        </artifactSet>
                        <filters>
                            <filter>
                                <!-- Do not copy the signatures in the META-INF folder.
                                Otherwise, this might cause SecurityExceptions when using the JAR. -->
                                <artifact>*:*</artifact>
                                <excludes>
                                    <exclude>META-INF/*.SF</exclude>
                                </excludes>
                            </filter>
                        </filters>
                        <transformers>
                            <!-- 某些jar包含具有相同文件名的其他资源（例如属性文件）。 为避免覆盖，您可以选择通过将它们的内容附加到一个文件中来合并它们-->
                            <transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
                                <resource>reference.conf</resource>
                            </transformer>
                            <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                <mainClass>mainclass</mainClass>
                            </transformer>
                        </transformers>
                    </configuration>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

自定义拦截器代码：

package com.bigdata;

import com.alibaba.fastjson.JSONObject;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.ArrayList;
import java.util.List;

public class zidingyi implements Interceptor {


    @Override
    public void initialize() {

    }

    @Override
    public Event intercept(Event event) {

        String body = new String(event.getBody());

        JSONObject jsonObject = JSONObject.parseObject(body);

        //获取json中的值
        int trFlag = jsonObject.getInteger("tr_flag");


        // 如果tr_flag中的值为0，就返回空
        if (trFlag == 0){
            return null;
        }

        return event;

    }

    @Override
    public List<Event> intercept(List<Event> list) {

        ArrayList<Event> filterEvents = new ArrayList<>();
        for (Event event : list) {

            Event intercept = intercept(event);

            if (intercept != null){

                filterEvents.add(intercept);

            }
        }
        return filterEvents;

    }

    @Override
    public void close() {

    }
    public static class BuilderEvent implements Builder{

        @Override
        public Interceptor build() {
            return new zidingyi();
        }

        @Override
        public void configure(Context context) {

        }

    }

}

使用maven打包，生成jar包后上传到flume下的lib目录下

2）：上传好jar包后，在flume下的conf中创建了一个myconf文件，创建一个zidinfyi.conf文件，编写flume的conf文件即可（记得使用自定义拦截器）

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /home/zidingyi/trans_info.json

#使用自定义拦截器
a1.sources.s1.interceptors = i1
# type指的是编写java代码所在目录的路径名（我的是在com.bigdata.zidingyi下）
a1.sources.s1.interceptors.i1.type = com.bigdata.zidingyi$BuilderEvent

# 修改sink为kafka
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.bootstrap.servers = bigdata01:9092
a1.sinks.k1.kafka.topic = zidingyi
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.producer.linger.ms = 1

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

执行之前，先在kafka中创建消息队列(topic)中创建一个topic ：zidingyi 数据将会导入到这个topic中

创建好后执行conf文件即可

flume-ng agent -c ./ -f zidingyi.conf -n a1 -Dflume.root.logger=INFO,console

可以使用

kafka-console-consumer.sh --bootstrap-server bigdata01:9092 --from-beginning --topic zidingyi

把主题中所有的数据都读取出来（包括历史数据）并且还可以接收来自生产者的新数据

3）：将topic中的数据抽取到hdfs中

里面的group.id随便指定即可

执行此conf文件即可

flume-ng agent -c ./ -f zidingyi2.conf -n a1 -Dflume.root.logger=INFO,console

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 100
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = bigdata01:9092,bigdata02:9092,bigdata03:9092
a1.sources.r1.kafka.topics = zidingyi
a1.sources.r1.kafka.consumer.group.id = donghu

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /zidingyi/ods/clearDate/%y-%m-%d/%H%M/
a1.sinks.k1.hdfs.filePrefix = events
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.useLocalTimeStamp=true

a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.rollSize = 102400
a1.sinks.k1.hdfs.rollInterval = 0

a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text

数据抽取成功

原文地址：https://blog.csdn.net/qq_62984376/article/details/143759037

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：Android中Crash Debug技巧
下一篇：鸿蒙next版开发：相机开发-元数据(ArkTS)

PCL 三维重建 RBF移动立方体三维重建算法
RBF（径向基函数）Marching Cubes算法是一种基于RBF插值的方法，用于从点云数据中提取三维表面。这种算法结合了传统的Marching Cubes算法和径向基函数的优势，能够处理复杂的点云
阅读更多2024-11-19
Argo workflow 拉取git 并使用pvc共享文件
第一个任务（拉取 Git 仓库）：这个任务将使用 git 命令克隆指定的 Git 仓库。第二个任务（读取 Git 文件）：这个任务会读取第一个任务拉取的 Git 仓库中的文件。我们将使用 Argo W
阅读更多2024-11-19
java计算机毕业设计选题参考3000篇
【294】springboot+jpa+layui学生住宿管理系统mysql学生寝室分配系统含文档。基于微信小程序的社区车位租赁系统的设计与实现+springboot后台weixin200。【483】
阅读更多2024-11-19
《Python网络安全项目实战》项目6 编写密码工具程序_练习题(2)答案
《Python网络安全项目实战》项目6 编写密码工具程序_练习题(2)答案
阅读更多2024-11-19
三种方式js的引入
1.js的组成部分：BOM(browser object model)浏览器对象模型、DOM(document object model)文档对象模型、ECMAScript。2.js的引入方式:行内式
阅读更多2024-11-19
使用MQTTX连接新版ONENet
使用mqtt连接新版的onenet 教程包含产品创建设备创建，关键参数获取，token软件获取，token生成，mqttx软件的下载与使用数据流的上传等手把手操作帮助你上云
阅读更多2024-11-19
深度学习之其他常见的生成式模型
自回归模型通过对图像数据的概率分布pdataxpdatax进行显式建模，并利用极大似然估计优化模型。pdatax∏i1npxi∣x1x2xi−1pdataxi1∏npxi∣x1x2..
阅读更多2024-11-19
MySQL表的新增与查询
这里的值要和列的个数和类型相匹配使用'或者"来表示字符串。
阅读更多2024-11-19
Vue-组件三大组成&组件通信
style的默认样式是作用到哪里的？scoped的作用是什么？style中推不推荐加scoped？data写成函数的目的是什么？组件通信，就是指组件与组件之间的数据传递组件的数据是独立的，无法直接访问
阅读更多2024-11-19
Python爬虫学习路线精简大纲！！！
Python爬虫学习路线精简大纲！！！
阅读更多2024-11-19

kafka：使用flume自定义拦截器，将json文件抽取到kafka的消息队列（topic）中，再从topic中将数据抽取到hdfs上

1）：创建maven项目，在pom文件中导入相关依赖

2）：上传好jar包后，在flume下的conf中创建了一个myconf文件，创建一个zidinfyi.conf文件，编写flume的conf文件即可（记得使用自定义拦截器）

3）：将topic中的数据抽取到hdfs中

相关文章