如何在PySpark中使用文字传递Catalog名称?

jdzmm42g  于 2024-01-06  发布在  Spark
关注(0)|答案(1)|浏览(161)

我正在尝试使用PySpark构建一个编程方法,以列出我在Databricks中拥有的所有目录中的所有数据库。虽然我可以使用SQL手动完成此操作,但我希望使用PySpark使其更健壮,以便我可以自动化它。
下面是我使用的代码:

list_catalogs = ['100sandbox', '200playground', '1000', 'sales']

df_catalogs_and_databases = None
_df = None

for catalog in list_catalogs:
    _df = spark.sql(f'SHOW DATABASES FROM {catalog}')\
        .select(
            lit(catalog).alias('catalog'),
            col('databaseName').alias('database')
        )
    try:
        df_catalogs_and_databases = df_catalogs_and_databases.union(_df)
    except AttributeError as e:
        # Catching this AttributeError: 'NoneType' object has no attribute 'union'
        df_catalogs_and_databases = _df
    except Exception as e:
        raise

display(df_catalogs_and_databases)

字符串
运行上述代码时出现以下错误:
[PARSE_SYNTAX_ERROR]在“100”处或附近出现错误。(第1行,位置20)

bvk5enib

bvk5enib1#

  • 错误是因为您试图直接创建名为“1000”的数据库,这可能会导致错误。
  • 在SQL数据库中,表名不能以数字开头,必须以字母或下划线开头。

我试过下面的,收到类似的错误像你一样。
我列出了这样的目录。

list_catalogs = ['100sandbox', '200playground', '1000', 'sales']

字符串

错误:

ParseException: 
[PARSE_SYNTAX_ERROR] Syntax error at or near '1000'.(line 1, pos 20)

== SQL ==
SHOW DATABASES FROM 1000


然后我列出如下:

list_catalogs = ['100sandbox', '200playground', 'db_1000', 'sales']
dilip_df = spark.sql("SHOW DATABASES") \
    .select(
        lit("all_catalogs").alias('catalog'),
        col('databaseName').alias('database')
    )
dilip_df.show(truncate=False)
+------------+-------------+
|catalog     |database     |
+------------+-------------+
|all_catalogs|100sandbox   |
|all_catalogs|200playground|
|all_catalogs|db_1000      |
|all_catalogs|default      |
|all_catalogs|sales        |
+------------+-------------+

的字符串
我尝试了以下方法:

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
from pyspark.sql.types import StructType, StructField, StringType
list_catalogs = ['100sandbox', '200playground', 'db_1000', 'sales']
schema = StructType([
    StructField("catalog", StringType(), True),
    StructField("database", StringType(), True),
])
df_catalogs_and_databases = spark.createDataFrame([], schema)
for catalog in list_catalogs:
    databases = spark.catalog.listDatabases(catalog)
     _df = spark.createDataFrame(
        [(catalog, db.name) for db in databases],
        schema
    )
    df_catalogs_and_databases = df_catalogs_and_databases.unionAll(_df)
df_catalogs_and_databases.show(truncate=False)
+-------------+-------------+
|catalog      |database     |
+-------------+-------------+
|100sandbox   |100sandbox   |
|200playground|200playground|
|db_1000      |db_1000      |
|sales        |sales        |
+-------------+-------------+

的字符串

相关问题