我正在处理一个用例,将数据存储在内存数据库中,并使其能够用于BI分析。
我的主要目标是,
1.在Ignite中缓存数据,如果数据超过可用堆大小,则将数据溢出到磁盘。
1.将CDC更改上插到现有表中。
1.通过JDBC从BI启用Ignite表以进行分析。
1.分析BI应在2到3秒内刷新。
作为本练习的一部分,我尝试使用Spark将历史数据(约7亿行,大小约为87 Gi)加载到Ignite中。
我能够将Spark与Ignite集成在一起,并成功地将 Dataframe 保存到Ignite表中。在加载了8 M条记录后,我观察到表分区在集群中分布不均匀。此外,磁盘上的数据大小比源数据大小要大。大约8 M记录占用了磁盘上20 Gi的空间。我尝试了所有的配置设置,但没有成功地在集群中均匀分布数据,并压缩了桌面上的数据文件。有人能帮我解决配置问题吗?我错过了任何Ignite & Spark配置均匀分布数据吗?另外,如何检查缓存了多少数据?
如果我需要提供更多信息,请告诉我。
点火表
CREATE TABLE edw_dds_ticket (
..
..
PRIMARY KEY (helix_uuid, ticket_issue_date)
) WITH "TEMPLATE=PARTITIONED,backups=1,affinity_key=ticket_issue_date";
node-configuration.xml
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans.xsd">
<bean class="org.apache.ignite.configuration.IgniteConfiguration">
<property name="workDirectory" value="/mnt/ignite/work"/>
<property name="dataStorageConfiguration">
<bean class="org.apache.ignite.configuration.DataStorageConfiguration">
<property name="defaultDataRegionConfiguration">
<bean class="org.apache.ignite.configuration.DataRegionConfiguration">
<property name="checkpointPageBufferSize" value="#{2048L * 1024 * 1024}"/>
<property name="persistenceEnabled" value="true"/>
<!-- Custom region name. -->
<property name="name" value="500MB_Region"/>
<!-- 100 MB initial size. -->
<property name="initialSize" value="#{100L * 1024 * 1024}"/>
<!-- 500 MB maximum size. -->
<property name="maxSize" value="#{500L * 1024 * 1024}"/>
</bean>
</property>
<property name="writeThrottlingEnabled" value="true"/>
<property name="storagePath" value="/mnt/ignite/data"/>
<property name="walPath" value="/mnt/ignite/wal"/>
<!-- Disabling wal archive set same path as wal-->
<property name="walArchivePath" value="/mnt/ignite/wal"/>
<!--<property name="walArchivePath" value="/mnt/ignite/walarchive"/>-->
<property name="walSegmentSize" value="#{256 * 1024 * 1024}"/>
<property name="walCompactionEnabled" value="true"/>
<property name="pageSize" value="#{8 * 1024}"/>
</bean>
</property>
<property name="discoverySpi">
<bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
<property name="ipFinder">
<bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.kubernetes.TcpDiscoveryKubernetesIpFinder">
<constructor-arg>
<bean class="org.apache.ignite.kubernetes.configuration.KubernetesConnectionConfiguration">
<property name="namespace" value="ignite" />
<property name="serviceName" value="ignite-service" />
</bean>
</constructor-arg>
</bean>
</property>
</bean>
</property>
</bean>
</beans>
statefulset.yaml
# An example of a Kubernetes configuration for pod deployment.
apiVersion: apps/v1
kind: StatefulSet
metadata:
# Cluster name.
name: ignite-cluster
namespace: ignite
spec:
# The initial number of pods to be started by Kubernetes.
replicas: 6
serviceName: ignite
selector:
matchLabels:
app: ignite
template:
metadata:
labels:
app: ignite
spec:
serviceAccountName: ignite
terminationGracePeriodSeconds: 60000
containers:
# Custom pod name.
- name: ignite-node
image: apacheignite/ignite:2.13.0
resources:
requests:
memory: "40Gi"
cpu: "1"
limits:
memory: "40Gi"
cpu: "4"
env:
- name: OPTION_LIBS
value: ignite-kubernetes,ignite-rest-http,ignite-compress,ignite-spark-2.4,ignite-spring,ignite-indexing,ignite-log4j2,ignite-slf4j
- name: CONFIG_URI
value: file:///mnt/ignite/config/node-configuration.xml
- name: JVM_OPTS
value: "-server -Xms30g -Xmx30g -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC -XX:MaxDirectMemorySize=2G -DIGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN=true -Djava.net.preferIPv4Stack=true "
- name: CONTROL_JVM_OPTS
value: "-server -Djava.net.preferIPv4Stack=true -Xms30g -Xmx30g -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC -XX:MaxDirectMemorySize=2G -DIGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN=true"
ports:
# Ports to open.
- containerPort: 47100 # communication SPI port
- containerPort: 47500 # discovery SPI port
- containerPort: 49112 # JMX port
- containerPort: 10800 # thin clients/JDBC driver port
- containerPort: 8080 # REST API
volumeMounts:
- mountPath: /mnt/ignite/config
name: config-vol
- mountPath: /mnt/ignite/data
name: data-vol
- mountPath: /mnt/ignite/wal
name: wal-vol
- mountPath: /mnt/ignite/work
name: work-vol
securityContext:
fsGroup: 2000 # try removing this if you have permission issues
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: agentpool
operator: In
values:
- userpool1
volumes:
- name: config-vol
configMap:
name: ignite-configmap-with-persistence
volumeClaimTemplates:
- metadata:
name: data-vol
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "managed-csi-premium"
resources:
requests:
storage: "100Gi"
- metadata:
name: work-vol
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "managed-csi-premium"
resources:
requests:
storage: "10Gi" # make sure to provide enought space for your application data
- metadata:
name: wal-vol
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "managed-csi-premium"
resources:
requests:
storage: "5Gi"
# - metadata:
# name: walarchive-vol
# spec:
# accessModes: [ "ReadWriteOnce" ]
# storageClassName: "managed-csi-premium"
# resources:
# requests:
# storage: "5Gi"
Spark客户端连接配置. spark-ignite-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: ignite-configmap
namespace: spark
data:
ignite-config.xml: |
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans.xsd">
<!-- Imports default Ignite configuration -->
<bean class="org.apache.ignite.configuration.IgniteConfiguration">
<!--<property name="peerClassLoadingEnabled" value="true"/> -->
<property name="clientMode" value="true"/>
<property name="discoverySpi">
<bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
<property name="ipFinder">
<bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.kubernetes.TcpDiscoveryKubernetesIpFinder">
<constructor-arg>
<bean class="org.apache.ignite.kubernetes.configuration.KubernetesConnectionConfiguration">
<property name="namespace" value="ignite" />
<property name="serviceName" value="ignite-service" />
</bean>
</constructor-arg>
</bean>
</property>
</bean>
</property>
</bean>
</beans>
spark.yaml
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: spark-ignite
namespace: spark
labels:
app: spark
spec:
type: Scala
mode: cluster
image: "spark:v2.4.7_ignite"
imagePullSecrets:
- image-pull-secret
imagePullPolicy: Always
mainClass: sparkentryclass
arguments:
- "2017-01-01"
- "/ignite/config/ignite-config.xml"
mainApplicationFile: "local:///opt/spark/examples/jars/IgnieDataFrame-1.0-SNAPSHOT-uber.jar"
sparkVersion: "2.4.7"
volumes:
- name: config-vol
configMap:
name: ignite-configmap
restartPolicy:
type: Never
driver:
cores: 1
memory: "10g"
labels:
version: 2.4.7
serviceAccount: spark
volumeMounts:
- name: config-vol
mountPath: /ignite/config
executor:
cores: 3
instances: 5
memory: "10g"
labels:
version: 2.4.7
volumeMounts:
- name: config-vol
mountPath: /ignite/config
Spark Log:它清楚地显示了6个集群(Ignite)和6个客户端(Spark Executors)。^--集群[主机=12,CPU =55,服务器=6,客户端=6,顶级版本=14,次要顶级版本=0] ^--网络[地址=[0:0:0:0:0:0:0:1%lo,127.0.0.1192.168.14.10],discoPort=0,
22/06/20 11:38:53 INFO TaskSetManager: Finished task 92.0 in stage 2.0 (TID 94) in 337817 ms on 192.168.14.14 (executor 3) (96/124)
22/06/20 11:39:13 INFO IgniteKernal:
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
^-- Node [id=ec62778c, uptime=00:19:00.103]
^-- Cluster [hosts=12, CPUs=55, servers=6, clients=6, topVer=14, minorTopVer=0]
^-- Network [addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.14.10], discoPort=0, commPort=47100]
^-- CPU [CPUs=16, curLoad=0.07%, avgLoad=0.12%, GC=0%]
^-- Heap [used=359MB, free=96.05%, comm=790MB]
^-- Outbound messages queue [size=0]
^-- Public thread pool [active=0, idle=0, qSize=0]
^-- System thread pool [active=0, idle=1, qSize=0]
^-- Striped thread pool [active=0, idle=16, qSize=0]
22/06/20 11:39:56 INFO TaskSetManager: Starting task 111.0 in stage 2.0 (TID 113, 192.168.14.12, executor 2, partition 111, PROCESS_LOCAL, 38281 bytes)
22/06/20 11:39:56 INFO TaskSetManager: Finished task 104.0 in stage 2.0 (TID 106) in 287029 ms on 192.168.14.12 (executor 2) (97/124)
22/06/20 11:40:01 INFO TaskSetManager: Starting task 112.0 in stage 2.0 (TID 114, 192.168.14.14, executor 3, partition 112, PROCESS_LOCAL, 35435 bytes)
22/06/20 11:40:01 INFO TaskSetManager: Finished task 94.0 in stage 2.0 (TID 96) in 354149 ms on 192.168.14.14 (executor 3) (98/124)
22/06/20 11:40:13 INFO IgniteKernal:
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
^-- Node [id=ec62778c, uptime=00:20:00.110]
^-- Cluster [hosts=12, CPUs=55, servers=6, clients=6, topVer=14, minorTopVer=0]
^-- Network [addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.14.10], discoPort=0, commPort=47100]
^-- CPU [CPUs=16, curLoad=0.03%, avgLoad=0.12%, GC=0%]
^-- Heap [used=369MB, free=95.94%, comm=790MB]
^-- Outbound messages queue [size=0]
^-- Public thread pool [active=0, idle=0, qSize=0]
^-- System thread pool [active=0, idle=1, qSize=0]
^-- Striped thread pool [active=0, idle=16, qSize=0]
Ignite群集Pod存储详细信息
Pod-5: k exec -it ignite-cluster-5 -- du -h /mnt/ignite/data
16.0K /mnt/ignite/data/lost+found
340.0K /mnt/ignite/data/node00-3d2f2427-89f9-4950-b9ce-668864d79493/metastorage
104.0K /mnt/ignite/data/node00-3d2f2427-89f9-4950-b9ce-668864d79493/cache-SQL_PUBLIC_EDW_DDS_TICKET
88.0K /mnt/ignite/data/node00-3d2f2427-89f9-4950-b9ce-668864d79493/cp
4.0K /mnt/ignite/data/node00-3d2f2427-89f9-4950-b9ce-668864d79493/snp
4.0K /mnt/ignite/data/node00-3d2f2427-89f9-4950-b9ce-668864d79493/TxLog
60.0K /mnt/ignite/data/node00-3d2f2427-89f9-4950-b9ce-668864d79493/cache-ignite-sys-cache
608.0K /mnt/ignite/data/node00-3d2f2427-89f9-4950-b9ce-668864d79493
632.0K /mnt/ignite/data
Pod-4: k exec -it ignite-cluster-4 -- du -h /mnt/ignite/data
16.0K /mnt/ignite/data/lost+found
60.0K /mnt/ignite/data/node00-4f28bdd5-bd15-484f-860c-fcfa665c63f1/cache-ignite-sys-cache
4.0K /mnt/ignite/data/node00-4f28bdd5-bd15-484f-860c-fcfa665c63f1/snp
104.0K /mnt/ignite/data/node00-4f28bdd5-bd15-484f-860c-fcfa665c63f1/cache-SQL_PUBLIC_EDW_DDS_TICKET
324.0K /mnt/ignite/data/node00-4f28bdd5-bd15-484f-860c-fcfa665c63f1/metastorage
4.0K /mnt/ignite/data/node00-4f28bdd5-bd15-484f-860c-fcfa665c63f1/TxLog
72.0K /mnt/ignite/data/node00-4f28bdd5-bd15-484f-860c-fcfa665c63f1/cp
576.0K /mnt/ignite/data/node00-4f28bdd5-bd15-484f-860c-fcfa665c63f1
600.0K /mnt/ignite/data
Pod-3: k exec -it ignite-cluster-3 -- du -h /mnt/ignite/data
16.0K /mnt/ignite/data/lost+found
24.0K /mnt/ignite/data/node00-1ec48b28-64c0-4dde-9690-2fea32cfb1f5/cp
316.0K /mnt/ignite/data/node00-1ec48b28-64c0-4dde-9690-2fea32cfb1f5/metastorage
4.0K /mnt/ignite/data/node00-1ec48b28-64c0-4dde-9690-2fea32cfb1f5/TxLog
18.4G /mnt/ignite/data/node00-1ec48b28-64c0-4dde-9690-2fea32cfb1f5/cache-SQL_PUBLIC_EDW_DDS_TICKET
60.0K /mnt/ignite/data/node00-1ec48b28-64c0-4dde-9690-2fea32cfb1f5/cache-ignite-sys-cache
4.0K /mnt/ignite/data/node00-1ec48b28-64c0-4dde-9690-2fea32cfb1f5/snp
18.4G /mnt/ignite/data/node00-1ec48b28-64c0-4dde-9690-2fea32cfb1f5
18.4G /mnt/ignite/data
Pod-2: k exec -it ignite-cluster-2 -- du -h /mnt/ignite/data
16.0K /mnt/ignite/data/lost+found
4.0K /mnt/ignite/data/node00-56ad3ba2-6d57-4405-bee9-5e155d2dffd4/snp
308.0K /mnt/ignite/data/node00-56ad3ba2-6d57-4405-bee9-5e155d2dffd4/metastorage
24.0K /mnt/ignite/data/node00-56ad3ba2-6d57-4405-bee9-5e155d2dffd4/cp
20.7G /mnt/ignite/data/node00-56ad3ba2-6d57-4405-bee9-5e155d2dffd4/cache-SQL_PUBLIC_EDW_DDS_TICKET
4.0K /mnt/ignite/data/node00-56ad3ba2-6d57-4405-bee9-5e155d2dffd4/TxLog
60.0K /mnt/ignite/data/node00-56ad3ba2-6d57-4405-bee9-5e155d2dffd4/cache-ignite-sys-cache
20.7G /mnt/ignite/data/node00-56ad3ba2-6d57-4405-bee9-5e155d2dffd4
20.7G /mnt/ignite/data
Pod-1: k exec -it ignite-cluster-1 -- du -h /mnt/ignite/data
60.0K /mnt/ignite/data/node00-b30da2e8-4af9-492b-b15c-3371f5871508/cache-ignite-sys-cache
4.0K /mnt/ignite/data/node00-b30da2e8-4af9-492b-b15c-3371f5871508/TxLog
308.0K /mnt/ignite/data/node00-b30da2e8-4af9-492b-b15c-3371f5871508/metastorage
4.0K /mnt/ignite/data/node00-b30da2e8-4af9-492b-b15c-3371f5871508/snp
28.0K /mnt/ignite/data/node00-b30da2e8-4af9-492b-b15c-3371f5871508/cp
2.1G /mnt/ignite/data/node00-b30da2e8-4af9-492b-b15c-3371f5871508/cache-SQL_PUBLIC_EDW_DDS_TICKET
2.1G /mnt/ignite/data/node00-b30da2e8-4af9-492b-b15c-3371f5871508
16.0K /mnt/ignite/data/lost+found
2.1G /mnt/ignite/data
Pod-1: k exec -it ignite-cluster-0 -- du -h /mnt/ignite/data
4.0K /mnt/ignite/data/node00-a1fbb947-6c8f-44ac-bb2c-7980f2316bb8/TxLog
324.0K /mnt/ignite/data/node00-a1fbb947-6c8f-44ac-bb2c-7980f2316bb8/metastorage
60.0K /mnt/ignite/data/node00-a1fbb947-6c8f-44ac-bb2c-7980f2316bb8/cache-ignite-sys-cache
4.0K /mnt/ignite/data/node00-a1fbb947-6c8f-44ac-bb2c-7980f2316bb8/snp
104.0K /mnt/ignite/data/node00-a1fbb947-6c8f-44ac-bb2c-7980f2316bb8/cache-SQL_PUBLIC_EDW_DDS_TICKET
88.0K /mnt/ignite/data/node00-a1fbb947-6c8f-44ac-bb2c-7980f2316bb8/cp
592.0K /mnt/ignite/data/node00-a1fbb947-6c8f-44ac-bb2c-7980f2316bb8
16.0K /mnt/ignite/data/lost+found
616.0K /mnt/ignite/data
1条答案
按热度按时间vm0i2vca1#
使用日期作为关联键往往是一个糟糕的选择,原因正是你所发现的。根据你所提供的信息,不可能说什么是“正确的”关联键(数据建模是困难的),但我可以说,如果你不指定一个,你会得到更好的分布。