Extracting information from PDF documents for use in automatic indexing of e-books
Keywords:
Software evaluation, DFMiner.six., PDFAct., PDF-extract, PDFExtract, Grobib, Automatic indexingAbstract
The number of electronic books that enter libraries in PDF format is greater every day. Complicating and making it almost unfeasible for some processes, traditionally carried out manually by librarians such as the assignment of subjects, to be done. In this context, it is necessary to design and develop applications that assist librarians. Taking this into consideration, we present in this work the evaluation of tools for extracting information from books in PDF format that could be used later as raw material for an automatic indexing system. To do this, we carried out a first evaluation of five software (PDFMiner.six, PDFAct, PDF-extract, PDFExtract, and Grobib), later, as PDFAct achieved the best performance, we did a second evaluation to find out their ability to identify and extract information from the books such as titles, indexes, sections, titles of tables and graphs and bibliographic reference which are relevant information for any indexing system. It is concluded that none of the evaluated tools adequately extracts the different parts of PDF books, although PDFAct has achieved a better performance than the rest.
Downloads
References
Alamoudi, A. et al. A rule-based information extraction approach for extracting metadata from PDF books. ICICExpress Letters, Part B: Applications, v. 12, n. 2, p. 121-132, 2021. Doi: https://doi.org/ 10.24507/icicelb.12.02.121
Anggakusuma, J.; Mawardi, V.C.; Lauro, M.D. Resume extraction with conditional random field method. IOP Conference Series: Materials Science and Engineering, v. 1007, n. 1, 012154. 2020. Doi: https://doi.org/10.1088/1757-899X/1007/1/012154
Bui, D. D. A.; Del Fiol, G.; Jonnalagadda, S. PDF text classification to leverage information extraction from publication reports. Journal of Biomedical Informatics, v. 61, p. 141-148, 2016.
Chaniago, R.; Khodra, M. Information extraction on novel text using machine learning and rule-based system. In: International Conference on Innovative and Creative Information Technology, 2017. [S.l.]. Proceedings […]. [S.l.]: IEEE Explore, 2017. p. 1-6.
Chaudary, A. et al. Extraction of useful information from Crude Job Descriptions. In: IEEE International Multi-Topic Conference, INMIC, 23rd., 2020, Bahawalpur. Proceedings […]. [S.l.]: IEEE Explore, 2020. p. 1-4. Doi: https://doi.org/10.1109/INMIC50486.2020.9318132
Dong, A. et al. Citation Metadata Extraction via Deep Neural Network-based Segment Sequence Labeling. In: Conference on Information and Knowledge Management, 2017. Singapore. Proceedings […]. [S./.]: ACM, 2017. p. 1967-1970. Doi: https://doi.org/10.1145/3132847.3133074
Gil-Leiva, I. Manual de indización: teoría y práctica. Gijón: Trea,2008.
Gil-Leiva, I. et al. The abandonment of the assignment of subject headings and classification codes in University Libraries due to the massive emergence of electronic books. Knowledge Organization, v. 47, n. 8, p. 646-667. 2020. Doi: https://doi.org/10.5771/0943-7444-2020-8-646
Haviana, S.; Subroto, I. Obtaining reference’s topic congruity in Indonesian publications using machine learning approach. 2019. In: International Conference on Electrical Engineering, Computer Science and Informatics (EECSI), 6., 2019 [S.l.]. Proceedings […]. [S.l.:s.n.]: 2019. p. 428-431. Doi: https:// doi.org/10.23919/EECSI48112.2019.8976985
Jayaram, K.; Sangeeta, K. A review: Information extraction techniques from research papers. 2017. In: IEEE International Conference on Innovative Mechanisms for Industry Applications, 2017, Bengaluru, India. Proceedings […]. New York: IEEE, 2017. p. 56-59. Doi: https://doi.org/10.1109/ICIMIA.2017.7975532
Khusro, S.; Latif, A.; Ullah, I. On methods and tools of table detection, extraction and annotation in PDF documents. Journal of Information Science, v. 41, n. 1, p. 41-57, 2015. Doi: https://doi.org/10.1177/0165551514551903
Najah-Imane, B.; R’emi, J.; Sira, F. Table-of-contents generation on contemporary documents. In: International Conference on Document Analysis and Recognition (ICDAR), 15th., 2019, Sydney, Australia, september 20-25, 2019. Proceedings […]. New York: IEEE, 2019. p. 100-107. Doi: https://doi.org/10.1109/ICDAR.2019.00025
Nasar, Z.; Jaffry, S. W.; Malik, M. K. Information extraction from scientific articles: a survey. Scientometrics, v. 117, n. 3, p. 1931-1990, 2018. Doi: https://doi.org/10.1007/s11192-018-2921-5
Nitu, M. et al. Reconstructing scanned documents for full-text indexing to empower digital library services. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), v. 11984 LNCS, p. 183-190, 2020.
Ojokoh, B. A.; Adewale, O. S.; Falaki, S.O. Automated document metadata extraction. Journal of Information Science, v. 35, n. 5, p. 563-570, 2009. Doi: https://doi.org/10.1177/0165551509105195
Perez-Arriaga, M.O.; Estrada, T.; Abad-Mota, S. Tao: system for table detection and extraction from PDF documents. In: Markov, Z.; Russell, I. (ed.). Proceedings of the Twenty-Ninth International Florida Artificial Intelligence Research Society Conference, FLAIRS 2016, Key Largo, Florida, May 16-18, 2016. Palo Alto: AAAI Press, 2016. p. 591-596.
Pudasaini, S. et al. Application of NLP for information extraction from unstructured documents. Lecture Notes in Networks and Systems, v. 209, p. 695-704, 2021. Doi: https://doi.org/10.1007/978-981-16-2126-0_54
Ratcliff, J. W.; Metzener, D. E. Pattern matching: the gestalt approach. Dr. Dobb’s Journal, v. 13, n. 7, p. 46, 1988.
Sandanayake, T. C. et al. Automated CV analyzing and ranking tool to select candidates for job positions. In: Proceedings of the 6th International Conference on Information Technology: IoT and Smart City. 2018, Hong Kong. Proceedings […]. New York, NY: Association for Computing Machinery, 2018. p. 13-18. Doi: https://doi.org/10.1145/3301551.3301579
Shahid, M. H.; Islam, M. A. TOC generation in PDF Document for smart automated compliance engine. In: International Symposium on Recent Advances in Electrical Engineering & Computer Sciences (RAEE & CS), 2020, p. 1-5, Islamabad, Pakistan. Proceedings […]. New York: IEEE, 2020. Doi: https://
doi.org/10.1109/raeecs50817.2020.9265792
Tkaczyk, D. et al. Machine learning vs. rules and outof- the-box vs. retrained: an evaluation of open-source bibliographic reference and citation parsers. In: ACM/IEEE on Joint Conference on Digital Libraries, 18., June 3-7, 2018, Fort Worth, Texas, USA. Proceedings […]. New York, NY:
Association for Computing Machinery, 2018. https://doi.org/10.1145/3197026.3197048
Zaman, G.; Mahdin, H.; Hussain, K. Information extraction from semi and unstructured data sources: a systematic literature review. ICIC Express Letters, v. 14, n. 6, p. 593-603, 2020. Doi: https://doi.org/10.24507/icicel.14.06.593