Apache Spark 将数据保存到Ignite群集的分布不均匀

x33g5p2x  于 2022-12-19  发布在  Apache
关注(0)|答案(1)|浏览(99)

我正在处理一个用例,将数据存储在内存数据库中,并使其能够用于BI分析。
我的主要目标是,
1.在Ignite中缓存数据,如果数据超过可用堆大小,则将数据溢出到磁盘。
1.将CDC更改上插到现有表中。
1.通过JDBC从BI启用Ignite表以进行分析。
1.分析BI应在2到3秒内刷新。
作为本练习的一部分,我尝试使用Spark将历史数据(约7亿行,大小约为87 Gi)加载到Ignite中。
我能够将Spark与Ignite集成在一起,并成功地将 Dataframe 保存到Ignite表中。在加载了8 M条记录后,我观察到表分区在集群中分布不均匀。此外,磁盘上的数据大小比源数据大小要大。大约8 M记录占用了磁盘上20 Gi的空间。我尝试了所有的配置设置,但没有成功地在集群中均匀分布数据,并压缩了桌面上的数据文件。有人能帮我解决配置问题吗?我错过了任何Ignite & Spark配置均匀分布数据吗?另外,如何检查缓存了多少数据?
如果我需要提供更多信息,请告诉我。
点火表

CREATE TABLE edw_dds_ticket (
    ..
    ..
    PRIMARY KEY (helix_uuid, ticket_issue_date)
) WITH "TEMPLATE=PARTITIONED,backups=1,affinity_key=ticket_issue_date";

node-configuration.xml

<?xml version="1.0" encoding="UTF-8"?>
    <beans xmlns="http://www.springframework.org/schema/beans"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://www.springframework.org/schema/beans
        http://www.springframework.org/schema/beans/spring-beans.xsd">
        <bean class="org.apache.ignite.configuration.IgniteConfiguration">
            <property name="workDirectory" value="/mnt/ignite/work"/>
            <property name="dataStorageConfiguration">
                <bean class="org.apache.ignite.configuration.DataStorageConfiguration">
                    <property name="defaultDataRegionConfiguration">
                        <bean class="org.apache.ignite.configuration.DataRegionConfiguration">
                            <property name="checkpointPageBufferSize" value="#{2048L * 1024 * 1024}"/>
                            <property name="persistenceEnabled" value="true"/>
                            <!-- Custom region name. -->
                            <property name="name" value="500MB_Region"/>
                            <!-- 100 MB initial size. -->
                            <property name="initialSize" value="#{100L * 1024 * 1024}"/>
                            <!-- 500 MB maximum size. -->
                            <property name="maxSize" value="#{500L * 1024 * 1024}"/>
                        </bean>
                    </property>
                    <property name="writeThrottlingEnabled" value="true"/>
                    <property name="storagePath" value="/mnt/ignite/data"/>
                    <property name="walPath" value="/mnt/ignite/wal"/>
                    <!-- Disabling wal archive set same path as wal-->
                    <property name="walArchivePath" value="/mnt/ignite/wal"/>
                    <!--<property name="walArchivePath" value="/mnt/ignite/walarchive"/>-->
                    <property name="walSegmentSize" value="#{256 * 1024 * 1024}"/>
                    <property name="walCompactionEnabled" value="true"/>
                    <property name="pageSize" value="#{8 * 1024}"/>
                </bean>
            </property>
            <property name="discoverySpi">
                <bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
                    <property name="ipFinder">
                        <bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.kubernetes.TcpDiscoveryKubernetesIpFinder">
                            <constructor-arg>
                                <bean class="org.apache.ignite.kubernetes.configuration.KubernetesConnectionConfiguration">
                                    <property name="namespace" value="ignite" />
                                    <property name="serviceName" value="ignite-service" />
                                </bean>
                            </constructor-arg>
                        </bean>
                    </property>
                </bean>
            </property>
        </bean>
    </beans>

statefulset.yaml

# An example of a Kubernetes configuration for pod deployment.
apiVersion: apps/v1
kind: StatefulSet
metadata:
  # Cluster name.
  name: ignite-cluster
  namespace: ignite
spec:
  # The initial number of pods to be started by Kubernetes.
  replicas: 6
  serviceName: ignite
  selector:
    matchLabels:
      app: ignite
  template:
    metadata:
      labels:
        app: ignite
    spec:
      serviceAccountName: ignite
      terminationGracePeriodSeconds: 60000
      containers:
        # Custom pod name.
      - name: ignite-node
        image: apacheignite/ignite:2.13.0
        resources:
          requests:
            memory: "40Gi"
            cpu: "1"
          limits:
            memory: "40Gi"
            cpu: "4"
        env:
        - name: OPTION_LIBS
          value: ignite-kubernetes,ignite-rest-http,ignite-compress,ignite-spark-2.4,ignite-spring,ignite-indexing,ignite-log4j2,ignite-slf4j
        - name: CONFIG_URI
          value: file:///mnt/ignite/config/node-configuration.xml
        - name: JVM_OPTS
          value: "-server -Xms30g -Xmx30g -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC -XX:MaxDirectMemorySize=2G -DIGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN=true -Djava.net.preferIPv4Stack=true "
        - name: CONTROL_JVM_OPTS
          value: "-server -Djava.net.preferIPv4Stack=true -Xms30g -Xmx30g -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC -XX:MaxDirectMemorySize=2G  -DIGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN=true"
        ports:
        # Ports to open.
        - containerPort: 47100 # communication SPI port
        - containerPort: 47500 # discovery SPI port
        - containerPort: 49112 # JMX port
        - containerPort: 10800 # thin clients/JDBC driver port
        - containerPort: 8080 # REST API
        volumeMounts:
        - mountPath: /mnt/ignite/config
          name: config-vol
        - mountPath: /mnt/ignite/data
          name: data-vol
        - mountPath: /mnt/ignite/wal
          name: wal-vol
        - mountPath: /mnt/ignite/work
          name: work-vol
      securityContext:
        fsGroup: 2000 # try removing this if you have permission issues
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 1
            preference:
              matchExpressions:
              - key: agentpool
                operator: In
                values:
                - userpool1
      volumes:
      - name: config-vol
        configMap:
          name: ignite-configmap-with-persistence
  volumeClaimTemplates:
  - metadata:
      name: data-vol
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "managed-csi-premium"
      resources:
        requests:
          storage: "100Gi" 
  - metadata:
      name: work-vol
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "managed-csi-premium"
      resources:
        requests:
          storage: "10Gi" # make sure to provide enought space for your application data
  - metadata:
      name: wal-vol
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "managed-csi-premium"
      resources:
        requests:
          storage: "5Gi"
#  - metadata:
#      name: walarchive-vol
#    spec:
#      accessModes: [ "ReadWriteOnce" ]
#      storageClassName: "managed-csi-premium"
#      resources:
#        requests:
#          storage: "5Gi"

Spark客户端连接配置. spark-ignite-configmap.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: ignite-configmap
  namespace: spark
data:
  ignite-config.xml: |
    <?xml version="1.0" encoding="UTF-8"?>
    <beans xmlns="http://www.springframework.org/schema/beans"
           xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
           xsi:schemaLocation="http://www.springframework.org/schema/beans
            http://www.springframework.org/schema/beans/spring-beans.xsd">
        <!-- Imports default Ignite configuration -->
        <bean class="org.apache.ignite.configuration.IgniteConfiguration">
            <!--<property name="peerClassLoadingEnabled" value="true"/> -->
            <property name="clientMode" value="true"/>
            <property name="discoverySpi">
                <bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
                    <property name="ipFinder">
                        <bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.kubernetes.TcpDiscoveryKubernetesIpFinder">
                            <constructor-arg>
                                <bean class="org.apache.ignite.kubernetes.configuration.KubernetesConnectionConfiguration">
                                    <property name="namespace" value="ignite" />
                                    <property name="serviceName" value="ignite-service" />
                                </bean>
                            </constructor-arg>
                        </bean>
                    </property>
                </bean>
            </property>
        </bean>
    </beans>

spark.yaml

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-ignite
  namespace: spark
  labels:
    app: spark
spec:
  type: Scala
  mode: cluster
  image: "spark:v2.4.7_ignite"
  imagePullSecrets:
    - image-pull-secret
  imagePullPolicy: Always 
  mainClass: sparkentryclass
  arguments:
    - "2017-01-01"
    - "/ignite/config/ignite-config.xml"
  mainApplicationFile: "local:///opt/spark/examples/jars/IgnieDataFrame-1.0-SNAPSHOT-uber.jar"
  sparkVersion: "2.4.7"
  volumes:
    - name: config-vol
      configMap:
        name: ignite-configmap
  restartPolicy:
    type: Never
  driver:
    cores: 1
    memory: "10g"
    labels:
      version: 2.4.7
    serviceAccount: spark
    volumeMounts:
      - name: config-vol
        mountPath: /ignite/config
  executor:
    cores: 3 
    instances: 5
    memory: "10g"
    labels:
      version: 2.4.7
    volumeMounts:
      - name: config-vol
        mountPath: /ignite/config

Spark Log:它清楚地显示了6个集群(Ignite)和6个客户端(Spark Executors)。^--集群[主机=12,CPU =55,服务器=6,客户端=6,顶级版本=14,次要顶级版本=0] ^--网络[地址=[0:0:0:0:0:0:0:1%lo,127.0.0.1192.168.14.10],discoPort=0,

22/06/20 11:38:53 INFO TaskSetManager: Finished task 92.0 in stage 2.0 (TID 94) in 337817 ms on 192.168.14.14 (executor 3) (96/124)
22/06/20 11:39:13 INFO IgniteKernal:
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=ec62778c, uptime=00:19:00.103]
    ^-- Cluster [hosts=12, CPUs=55, servers=6, clients=6, topVer=14, minorTopVer=0]
    ^-- Network [addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.14.10], discoPort=0, commPort=47100]
    ^-- CPU [CPUs=16, curLoad=0.07%, avgLoad=0.12%, GC=0%]
    ^-- Heap [used=359MB, free=96.05%, comm=790MB]
    ^-- Outbound messages queue [size=0]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=1, qSize=0]
    ^-- Striped thread pool [active=0, idle=16, qSize=0]
22/06/20 11:39:56 INFO TaskSetManager: Starting task 111.0 in stage 2.0 (TID 113, 192.168.14.12, executor 2, partition 111, PROCESS_LOCAL, 38281 bytes)
22/06/20 11:39:56 INFO TaskSetManager: Finished task 104.0 in stage 2.0 (TID 106) in 287029 ms on 192.168.14.12 (executor 2) (97/124)
22/06/20 11:40:01 INFO TaskSetManager: Starting task 112.0 in stage 2.0 (TID 114, 192.168.14.14, executor 3, partition 112, PROCESS_LOCAL, 35435 bytes)
22/06/20 11:40:01 INFO TaskSetManager: Finished task 94.0 in stage 2.0 (TID 96) in 354149 ms on 192.168.14.14 (executor 3) (98/124)
22/06/20 11:40:13 INFO IgniteKernal:
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=ec62778c, uptime=00:20:00.110]
    ^-- Cluster [hosts=12, CPUs=55, servers=6, clients=6, topVer=14, minorTopVer=0]
    ^-- Network [addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.14.10], discoPort=0, commPort=47100]
    ^-- CPU [CPUs=16, curLoad=0.03%, avgLoad=0.12%, GC=0%]
    ^-- Heap [used=369MB, free=95.94%, comm=790MB]
    ^-- Outbound messages queue [size=0]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=1, qSize=0]
    ^-- Striped thread pool [active=0, idle=16, qSize=0]

Ignite群集Pod存储详细信息

Pod-5: k exec -it ignite-cluster-5 -- du -h /mnt/ignite/data
16.0K   /mnt/ignite/data/lost+found
340.0K  /mnt/ignite/data/node00-3d2f2427-89f9-4950-b9ce-668864d79493/metastorage
104.0K  /mnt/ignite/data/node00-3d2f2427-89f9-4950-b9ce-668864d79493/cache-SQL_PUBLIC_EDW_DDS_TICKET
88.0K   /mnt/ignite/data/node00-3d2f2427-89f9-4950-b9ce-668864d79493/cp
4.0K    /mnt/ignite/data/node00-3d2f2427-89f9-4950-b9ce-668864d79493/snp
4.0K    /mnt/ignite/data/node00-3d2f2427-89f9-4950-b9ce-668864d79493/TxLog
60.0K   /mnt/ignite/data/node00-3d2f2427-89f9-4950-b9ce-668864d79493/cache-ignite-sys-cache
608.0K  /mnt/ignite/data/node00-3d2f2427-89f9-4950-b9ce-668864d79493
632.0K  /mnt/ignite/data
 
Pod-4: k exec -it ignite-cluster-4 -- du -h /mnt/ignite/data
16.0K   /mnt/ignite/data/lost+found
60.0K   /mnt/ignite/data/node00-4f28bdd5-bd15-484f-860c-fcfa665c63f1/cache-ignite-sys-cache
4.0K    /mnt/ignite/data/node00-4f28bdd5-bd15-484f-860c-fcfa665c63f1/snp
104.0K  /mnt/ignite/data/node00-4f28bdd5-bd15-484f-860c-fcfa665c63f1/cache-SQL_PUBLIC_EDW_DDS_TICKET
324.0K  /mnt/ignite/data/node00-4f28bdd5-bd15-484f-860c-fcfa665c63f1/metastorage
4.0K    /mnt/ignite/data/node00-4f28bdd5-bd15-484f-860c-fcfa665c63f1/TxLog
72.0K   /mnt/ignite/data/node00-4f28bdd5-bd15-484f-860c-fcfa665c63f1/cp
576.0K  /mnt/ignite/data/node00-4f28bdd5-bd15-484f-860c-fcfa665c63f1
600.0K  /mnt/ignite/data
                                                                                                                                                                                                                                                                           Pod-3: k exec -it ignite-cluster-3 -- du -h /mnt/ignite/data
16.0K   /mnt/ignite/data/lost+found
24.0K   /mnt/ignite/data/node00-1ec48b28-64c0-4dde-9690-2fea32cfb1f5/cp
316.0K  /mnt/ignite/data/node00-1ec48b28-64c0-4dde-9690-2fea32cfb1f5/metastorage
4.0K    /mnt/ignite/data/node00-1ec48b28-64c0-4dde-9690-2fea32cfb1f5/TxLog
18.4G   /mnt/ignite/data/node00-1ec48b28-64c0-4dde-9690-2fea32cfb1f5/cache-SQL_PUBLIC_EDW_DDS_TICKET
60.0K   /mnt/ignite/data/node00-1ec48b28-64c0-4dde-9690-2fea32cfb1f5/cache-ignite-sys-cache
4.0K    /mnt/ignite/data/node00-1ec48b28-64c0-4dde-9690-2fea32cfb1f5/snp
18.4G   /mnt/ignite/data/node00-1ec48b28-64c0-4dde-9690-2fea32cfb1f5
18.4G   /mnt/ignite/data
                                                                                                                                                                                                                                                                           Pod-2: k exec -it ignite-cluster-2 -- du -h /mnt/ignite/data
16.0K   /mnt/ignite/data/lost+found
4.0K    /mnt/ignite/data/node00-56ad3ba2-6d57-4405-bee9-5e155d2dffd4/snp
308.0K  /mnt/ignite/data/node00-56ad3ba2-6d57-4405-bee9-5e155d2dffd4/metastorage
24.0K   /mnt/ignite/data/node00-56ad3ba2-6d57-4405-bee9-5e155d2dffd4/cp
20.7G   /mnt/ignite/data/node00-56ad3ba2-6d57-4405-bee9-5e155d2dffd4/cache-SQL_PUBLIC_EDW_DDS_TICKET
4.0K    /mnt/ignite/data/node00-56ad3ba2-6d57-4405-bee9-5e155d2dffd4/TxLog
60.0K   /mnt/ignite/data/node00-56ad3ba2-6d57-4405-bee9-5e155d2dffd4/cache-ignite-sys-cache
20.7G   /mnt/ignite/data/node00-56ad3ba2-6d57-4405-bee9-5e155d2dffd4
20.7G   /mnt/ignite/data
                                                                                                                                                                                                                                                                           Pod-1: k exec -it ignite-cluster-1 -- du -h /mnt/ignite/data
60.0K   /mnt/ignite/data/node00-b30da2e8-4af9-492b-b15c-3371f5871508/cache-ignite-sys-cache
4.0K    /mnt/ignite/data/node00-b30da2e8-4af9-492b-b15c-3371f5871508/TxLog
308.0K  /mnt/ignite/data/node00-b30da2e8-4af9-492b-b15c-3371f5871508/metastorage
4.0K    /mnt/ignite/data/node00-b30da2e8-4af9-492b-b15c-3371f5871508/snp
28.0K   /mnt/ignite/data/node00-b30da2e8-4af9-492b-b15c-3371f5871508/cp
2.1G    /mnt/ignite/data/node00-b30da2e8-4af9-492b-b15c-3371f5871508/cache-SQL_PUBLIC_EDW_DDS_TICKET
2.1G    /mnt/ignite/data/node00-b30da2e8-4af9-492b-b15c-3371f5871508
16.0K   /mnt/ignite/data/lost+found
2.1G    /mnt/ignite/data
                                                                                                                                                                                                                                                                           Pod-1: k exec -it ignite-cluster-0 -- du -h /mnt/ignite/data
4.0K    /mnt/ignite/data/node00-a1fbb947-6c8f-44ac-bb2c-7980f2316bb8/TxLog
324.0K  /mnt/ignite/data/node00-a1fbb947-6c8f-44ac-bb2c-7980f2316bb8/metastorage
60.0K   /mnt/ignite/data/node00-a1fbb947-6c8f-44ac-bb2c-7980f2316bb8/cache-ignite-sys-cache
4.0K    /mnt/ignite/data/node00-a1fbb947-6c8f-44ac-bb2c-7980f2316bb8/snp
104.0K  /mnt/ignite/data/node00-a1fbb947-6c8f-44ac-bb2c-7980f2316bb8/cache-SQL_PUBLIC_EDW_DDS_TICKET
88.0K   /mnt/ignite/data/node00-a1fbb947-6c8f-44ac-bb2c-7980f2316bb8/cp
592.0K  /mnt/ignite/data/node00-a1fbb947-6c8f-44ac-bb2c-7980f2316bb8
16.0K   /mnt/ignite/data/lost+found
616.0K  /mnt/ignite/data
vm0i2vca

vm0i2vca1#

使用日期作为关联键往往是一个糟糕的选择,原因正是你所发现的。根据你所提供的信息,不可能说什么是“正确的”关联键(数据建模是困难的),但我可以说,如果你不指定一个,你会得到更好的分布。

相关问题