FUSION: Feature-based Processing of Heterogeneous Documents for Automated Information Extraction
- Information Extraction (IE) processes are often business-critical, but very hard to automate due to a heterogeneous data basis. Specific document characteristics, also called features, influence the optimal way of processing. Architecture for Automated Generation of Distributed Information Extraction Pipelines (ARTIFACT) supports businesses in successively automating their IE processes by finding optimal IE pipelines. However, ARTIFACT treats each document the same way, and does not enable document-specific processing. Single solution strategies can perform extraordinarily well for documents with particular traits. While manual approvals are superfluous for these documents, ARTIFACT does not provide the opportunity for Fully Automatic Processing (FAP). Therefore, we introduce an enhanced pattern that integrates an extensible and domain-independent concept of feature detection based on microservices. Due to this, we create two fundamental benefits. First, the document-specific process ing increases the quality of automated generated IE pipelines. Second, the system enables FAP to eliminate superfluous approval efforts.
Author: | Michael SildatkeORCiD, Hendrik Karwanni, Bodo Kraft, Albert Zündorf |
---|---|
DOI: | https://doi.org/10.5220/0011351100003266 |
ISBN: | 978-989-758-588-3 |
ISSN: | 2184-2833 |
Parent Title (English): | Proceedings of the 17th International Conference on Software Technologies - ICSOFT |
Document Type: | Conference Proceeding |
Language: | English |
Year of Completion: | 2022 |
First Page: | 250 |
Last Page: | 260 |
Note: | 17th International Conference on Software Technologies, July 11-13, 2022, in Lisbon, Portugal |
Link: | https://www.scitepress.org/Link.aspx?doi=10.5220/0011351100003266 |
Institutes: | FH Aachen / Fachbereich Energietechnik |
FH Aachen / Fachbereich Medizintechnik und Technomathematik |