Flink SQL中怎么注册python以及使用python注册的UDF中数据流是怎么流转的

🕗 发布于 2024-10-08 06:27 flink sql python

背景

本文基于 Flink 1.17.0
和Spark SQL中怎么注册python以及使用python注册的UDF中数据流是怎么流转的
目的一样，为了阐述 Flink SQL 对 python UDF的处理

分析

注册python udf以及调用

如create-function所示，可以用DSL进行 udf的注册，引用StreamPythonUdfSqlJob.java中的例子：

 tEnv.executeSql(
                "create temporary system function add_one as 'add_one.add_one' language python");

        tEnv.createTemporaryView("source", tEnv.fromValues(1L, 2L, 3L).as("a"));

        Iterator<Row> result = tEnv.executeSql("select add_one(a) as a from source").collect();

其中 add_one.py 为：

from pyflink.table import DataTypes
from pyflink.table.udf import udf


@udf(input_types=[DataTypes.BIGINT()], result_type=DataTypes.BIGINT())
def add_one(i):
    import pytest
    return i + 1

也是用python中注册 add_one 函数，之后在 SQL中进行调用

调用python udf的数据流

注册

如 create temporary system function add_one as 'add_one.add_one' language python 这个DSL中的定义的SQL，最终会变成
CreateTempSystemFunctionOperation 最终会走到 TableEnvironmentImpl.executeInternal 中的 createSystemFunction((CreateTempSystemFunctionOperation) operation)方法：

public void registerTemporarySystemFunction(
            String name, CatalogFunction function, boolean ignoreIfExists) {
        final String normalizedName = FunctionIdentifier.normalizeName(name);

        try {
            validateAndPrepareFunction(name, function);
        } catch (Throwable t) {
            throw new ValidationException(
                    String.format(
                            "Could not register temporary system function '%s' due to implementation errors.",
                            name),
                    t);
        }
        if (!tempSystemFunctions.containsKey(normalizedName)) {
            tempSystemFunctions.put(normalizedName, function);

最终会保存到 FunctionCatalog.tempSystemFunctions变量中, 这个变量在后续的查找函数的时候会被调用到。

调用

对于Flink来说，每一个函数，都会经过FunctionCatalog.lookupFunction方法：

 public Optional<ContextResolvedFunction> lookupFunction(UnresolvedIdentifier identifier) {
        // precise function reference
        if (identifier.getDatabaseName().isPresent()) {
            return resolvePreciseFunctionReference(catalogManager.qualifyIdentifier(identifier));
        } else {
            // ambiguous function reference
            return resolveAmbiguousFunctionReference(identifier.getObjectName());
        }
    }

对应的数据流为：

FunctionCatalog.resolveAmbiguousFunctionReference

getFunctionDefinition(normalizedName, tempSystemFunctions.get(normalizedName))

UserDefinedFunctionHelper.instantiateFunction

PythonFunctionUtils.getPythonFunction(catalogFunction.getClassName(), config, classLoader)

PythonFunctionUtils.pythonFunctionFactory(利用反射调用 getPythonFunction)

最终会调用 PythonFunctionFactory.getPythonFunction 该方法会最终调用 createPythonFunctionFactory 方法，

该方法会调用python -m pyflink.pyflink_callback_server P动，这里启动相关的都是跟Py4j有关，其中这里就会把python中的PythonFunctionFactory 放到 java中的gatewayServer 的hashMap中，而这里启动的Py4j客户端就在 startGatewayServer方法中，这个命令 python -m pyflink.pyflink_callback_server会把 python 的PythonFunctionFactory()对象放入 Py4j 的客户端中，
PythonFunctionFactory 代码如下：

class PythonFunctionFactory(object):
           """
           Used to create PythonFunction objects for Java jobs.
           """

           def getPythonFunction(self, moduleName, objectName):
               udf_wrapper = getattr(importlib.import_module(moduleName), objectName)
               return udf_wrapper._java_user_defined_function()

           class Java:
               implements = ["org.apache.flink.client.python.PythonFunctionFactory"]

所以createPythonFunctionFactory方法中：

pythonProcess =
                        launchPy4jPythonClient(
                                gatewayServer, config, commands, null, tmpDir, false);
                entryPoint = (Map<String, Object>) gatewayServer.getGateway().getEntryPoint();
...
return new PythonFunctionFactoryImpl(
                (PythonFunctionFactory) entryPoint.get("PythonFunctionFactory"), shutdownHook);

最终返回的 PythonFunctionFactoryImpl是包含了python的 PythonFunctionFactory 对象，所以前面返回的PythonFunctionUtils.getPythonFunctio都是包裹了python的java对象，所以后续的调用都是基于 Py4j 的进程间的调用了

总结

所以说 Flink SQL 调用 python UDF 还是采用了 Py4j ，这种方式也是采用了进程间通信的方式，在效率上还是比不了基于 java/scala 而写的UDF，这种方式和Spark SQL中怎么注册python以及使用python注册的UDF中数据流是怎么流转的类似。

原文地址：https://blog.csdn.net/monkeyboy_tech/article/details/142741051

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：【60天备战2024年11月软考高级系统架构设计师——第39天：性能优化与高可用设计】
下一篇：C++基础（12）——初识list

SQL自学：什么是子查询，如何使用它们
子查询是一个嵌套在另一个 SQL 查询（通常是 SELECT、INSERT、UPDATE 或 DELETE 语句）中的查询。它的结果被用作外部查询的一部分，以进一步筛选、聚合或连接数据。子查询可以返回
阅读更多2024-10-10
Docker容器不断重启问题
在compose配置文件中，卷标app会被自动重命名为node_app，而恰好上次使用的卷标没有清理，冲突了。查看docker已经存在的卷标。
阅读更多2024-10-10
Linux——cp-mv-rm命令
复制文件 cp test01.txt test02.txt。删除文件 rm test.txt。删除文件夹（目录 rm。
阅读更多2024-10-10
【Linux第一弹】- 基本指令
Linux基本指令
阅读更多2024-10-10
Android实现RecyclerView宽度变化动画
实现思路就是定义一个属性动画，在动画监听器中不断修改RecyclerView的宽度。
阅读更多2024-10-10
猜Follow邀请码
猜邀请码把zero*width替换成可能的字符(包含 a-z, A-Z 和 0-9),生成所有可能。
阅读更多2024-10-10
Python测试框架--Allure
AllureAllure是由Qameta Software团队开源的一款旨在于解决让每个人能更容易生成并更简洁阅读的测试报告框架。它支持大多数的测试框架，如：Pytest、TestNG等，简单易用便于
阅读更多2024-10-10
获取股票期货历史高频分钟以及macd量化
例如，如果在日线图上MACD显示买入信号，而周线图上MACD显示卖出信号，则可能需要更谨慎地对待这个买入信号。MACD线作为技术分析中的一个常用工具，确实存在一些高级使用技巧，这些技巧可以帮助投资者更
阅读更多2024-10-10
Flash 闪存技术基础与 SD NAND Flash 产品测试解析
本篇除了对flash闪存进行简单介绍外，另给读者推荐一种我本人也在用的小容量闪存。自带坏块管理的SD NAND Flash（贴片式TF卡），尺寸小巧，简单易用，兼容性强，稳定可靠，标准SDIO接口，兼
阅读更多2024-10-10
哪种隔音方式最好？小户型如何隔音？
在选择隔音板时，一定要看CMA（中国计量认证）和CNAS（中国合格评定国家认可委员会）的检测报告，除了此法，没有其他任何办法辨别它的性能，只能等施工结束后才知晓它的隔音效果。隔音毡是一种以橡胶、塑胶等
阅读更多2024-10-10

Flink SQL中怎么注册python以及使用python注册的UDF中数据流是怎么流转的

背景

分析

注册python udf以及调用

调用python udf的数据流

注册

调用

总结

相关文章