SQL Server How to merge multiple parquet files (more than 10+ assume )with different datatype to one parquet file using azure synapse?I had tried copy activity

5tmbdcev  于 2023-08-02  发布在  其他
关注(0)|答案(1)|浏览(91)

I am trying to merge multiple parquet files to single parquet file using Azure, since datatype of files are different and parquet files keeps the datatype schema in place, is it possible to, merge? Kindly provide the answer.

I have tried copy activity, and data flow, since I want to parameterize the different sub project to run, hence not possible using dataflow, as schema will change for different sub projects.

xzlaal3s

xzlaal3s1#

Achieving the above requirement using copy activity might not be possible because copy activity merging will skip some columns.

I am able to achieve it by iterating Data flows in a Foreach. Use Dataset parameters for this.

This is my source dataset with dataset parameter @dataset().filename .

In target, create empty file. Here I have created it manually. To do it dynamically, you can a use copy activity before the ForEach to copy an empty parquet file to the target location.

Now, use the above as two sources for the dataflow and don't import the projection for any of those.

Source1:

Source2:

Make sure you don't check the Infer drifted column types.

Now, use a Union transformation. this will merge the columns.

In sink, give the same target dataset which we used as one of the source and output the result to a single file in sink settings.

Here, I have taken the source files of different number of columns and different datatypes in same folder and used Get Meta data activity to get the list of files. Pass the file name to the Dataflow.

If your files are not in same folder, then first you need to get the list of file paths and pass that array to the ForEach. Use the dataset parameters according to the path of the dataset.

This method will take the empty target file first and unions the columns of source file and writes to the same target file in each iteration. At the end, this will union all of the source file columns and writes it to the target file.

Target file merged with different datatype columns:

NOTE: It converts data from each file to String type and the data types for every column in the above file will be a String type. For sample I took csv here, it will work same for .parquet as well.

If you want to generate lot of merging files for every folder, then you need to loop through above process for every folder.

相关问题