使用Erlang进程数运行Java XML解析器

cs7cruho  于 2022-12-08  发布在  Erlang
关注(0)|答案(1)|浏览(180)

I have a project in a concurrent and distributed programming course.
In this course we use Erlang.
I need to use some database from an XML file, that already has a parser written in java (this is the link for the XML and the parser: https://dblp.org/faq/1474681.html ). The XML file is 2.5GB, so I understand that the first step is to use a number of processes that I will create in erlang that will parse the XML and each process will parse a chunk of the XML.
The thing is that this is the first time I'm doing something like that (combine erlang and java, and parse a really big XML file), So I'm not sure how to approach this problem - divide the XML to chunks before I start to parse him? Somehow set start and end for each process that parses the XML?
Just to clarify - the course is about erlang and using processes in erlang, so I must use it (because I'm sure that there are java multi-threading solutions).
I will really appreciate any ideas or help! Thanks!

kxkpmulp

kxkpmulp1#

You can do it in Erlang without using Java. You do not need to read file completely before processing. You should use an XML parser which supports XML streaming API. I recommend to use fast_xml which is too fast (it uses C functions to parse XML). After initializing stream parser state, in a loop (recursive function) you should read file chunk by chunk (for example 1024 byte each chunk) and give each chunk to parser. If parser finds new XML elements, it will send them to your callback process in form of erlang messages. In your callback process you can spawn more processes to work on each XML element.

相关问题