I have a project in a concurrent and distributed programming course.
In this course we use Erlang.
I need to use some database from an XML file, that already has a parser written in java (this is the link for the XML and the parser: https://dblp.org/faq/1474681.html ). The XML file is 2.5GB, so I understand that the first step is to use a number of processes that I will create in erlang that will parse the XML and each process will parse a chunk of the XML.
The thing is that this is the first time I'm doing something like that (combine erlang and java, and parse a really big XML file), So I'm not sure how to approach this problem - divide the XML to chunks before I start to parse him? Somehow set start and end for each process that parses the XML?
Just to clarify - the course is about erlang and using processes in erlang, so I must use it (because I'm sure that there are java multi-threading solutions).
I will really appreciate any ideas or help! Thanks!
1条答案
按热度按时间kxkpmulp1#
You can do it in Erlang without using Java. You do not need to read file completely before processing. You should use an XML parser which supports XML streaming API. I recommend to use fast_xml which is too fast (it uses C functions to parse XML). After initializing stream parser state, in a loop (recursive function) you should read file chunk by chunk (for example 1024 byte each chunk) and give each chunk to parser. If parser finds new XML elements, it will send them to your callback process in form of erlang messages. In your callback process you can spawn more processes to work on each XML element.