Extracción de información de documentos PDF para su uso en la indización automática de e-books

Autores

Palavras-chave:

Evaluación de software, Grobib, Indización automática, PDFMiner.six, PDFAct., DF-extract., PDFExtract.

Resumo

El número de libros electrónicos que ingresan en las bibliotecas en formato PDF cada día es mayor, complicando y haciendo
casi inviables algunos procesos realizados tradicionalmente de forma manual por los bibliotecarios, como es la asignación de
materias. En este contexto, se hace necesario el diseño y desarrollo de aplicaciones que asistan a los bibliotecarios. Teniendo
esto en consideración, presentamos en este trabajo la evaluación de herramientas de extracción de información de libros en
PDF que podrían usarse posteriormente como materia prima para un sistema de indización automática. Para ello, realizamos
una primera evaluación de cinco softwares (PDFMiner.six, PDFAct, PDF-extract, PDFExtract y Grobib) y, posteriormente, como
PDFAct consiguió el mejor rendimiento, hicimos una segunda evaluación para averiguar su capacidad para identificar y
extraer informaciones de los libros, tales como títulos, índices, secciones, títulos de tablas y gráficos y referencias bibliográficas,
informaciones relevantes para cualquier sistema de indización. Se concluye que ninguna de las herramientas evaluadas extrae
adecuadamente las diferentes partes de libros en PDF, si bien, PDFAct ha logrado un rendimiento superior al del resto.

Downloads

Não há dados estatísticos.

Referências

Alamoudi, A. et al. A rule-based information extraction approach for extracting metadata from PDF books. ICICExpress Letters, Part B: Applications, v. 12, n. 2, p. 121-132, 2021. Doi: https://doi.org/ 10.24507/icicelb.12.02.121

Anggakusuma, J.; Mawardi, V.C.; Lauro, M.D. Resume extraction with conditional random field method. IOP Conference Series: Materials Science and Engineering, v. 1007, n. 1, 012154. 2020. Doi: https://doi.org/10.1088/1757-899X/1007/1/012154

Bui, D. D. A.; Del Fiol, G.; Jonnalagadda, S. PDF text classification to leverage information extraction from publication reports. Journal of Biomedical Informatics, v. 61, p. 141-148, 2016.

Chaniago, R.; Khodra, M. Information extraction on novel text using machine learning and rule-based system. In: International Conference on Innovative and Creative Information Technology, 2017. [S.l.]. Proceedings […]. [S.l.]: IEEE Explore, 2017. p. 1-6.

Chaudary, A. et al. Extraction of useful information from Crude Job Descriptions. In: IEEE International Multi-Topic Conference, INMIC, 23rd., 2020, Bahawalpur. Proceedings […]. [S.l.]: IEEE Explore, 2020. p. 1-4. Doi: https://doi.org/10.1109/INMIC50486.2020.9318132

Dong, A. et al. Citation Metadata Extraction via Deep Neural Network-based Segment Sequence Labeling. In: Conference on Information and Knowledge Management, 2017. Singapore. Proceedings […]. [S./.]: ACM, 2017. p. 1967-1970. Doi: https://doi.org/10.1145/3132847.3133074

Gil-Leiva, I. Manual de indización: teoría y práctica. Gijón: Trea,2008.

Gil-Leiva, I. et al. The abandonment of the assignment of subject headings and classification codes in University Libraries due to the massive emergence of electronic books. Knowledge Organization, v. 47, n. 8, p. 646-667. 2020. Doi: https://doi.org/10.5771/0943-7444-2020-8-646

Haviana, S.; Subroto, I. Obtaining reference’s topic congruity in Indonesian publications using machine learning approach. 2019. In: International Conference on Electrical Engineering, Computer Science and Informatics (EECSI), 6., 2019 [S.l.]. Proceedings […]. [S.l.:s.n.]: 2019. p. 428-431. Doi: https:// doi.org/10.23919/EECSI48112.2019.8976985

Jayaram, K.; Sangeeta, K. A review: Information extraction techniques from research papers. 2017. In: IEEE International Conference on Innovative Mechanisms for Industry Applications, 2017, Bengaluru, India. Proceedings […]. New York: IEEE, 2017. p. 56-59. Doi: https://doi.org/10.1109/ICIMIA.2017.7975532

Khusro, S.; Latif, A.; Ullah, I. On methods and tools of table detection, extraction and annotation in PDF documents. Journal of Information Science, v. 41, n. 1, p. 41-57, 2015. Doi: https://doi.org/10.1177/0165551514551903

Najah-Imane, B.; R’emi, J.; Sira, F. Table-of-contents generation on contemporary documents. In: International Conference on Document Analysis and Recognition (ICDAR), 15th., 2019, Sydney, Australia, september 20-25, 2019. Proceedings […]. New York: IEEE, 2019. p. 100-107. Doi: https://doi.org/10.1109/ICDAR.2019.00025

Nasar, Z.; Jaffry, S. W.; Malik, M. K. Information extraction from scientific articles: a survey. Scientometrics, v. 117, n. 3, p. 1931-1990, 2018. Doi: https://doi.org/10.1007/s11192-018-2921-5

Nitu, M. et al. Reconstructing scanned documents for full-text indexing to empower digital library services. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), v. 11984 LNCS, p. 183-190, 2020.

Ojokoh, B. A.; Adewale, O. S.; Falaki, S.O. Automated document metadata extraction. Journal of Information Science, v. 35, n. 5, p. 563-570, 2009. Doi: https://doi.org/10.1177/0165551509105195

Perez-Arriaga, M.O.; Estrada, T.; Abad-Mota, S. Tao: system for table detection and extraction from PDF documents. In: Markov, Z.; Russell, I. (ed.). Proceedings of the Twenty-Ninth International Florida Artificial Intelligence Research Society Conference, FLAIRS 2016, Key Largo, Florida, May 16-18, 2016. Palo Alto: AAAI Press, 2016. p. 591-596.

Pudasaini, S. et al. Application of NLP for information extraction from unstructured documents. Lecture Notes in Networks and Systems, v. 209, p. 695-704, 2021. Doi: https://doi.org/10.1007/978-981-16-2126-0_54

Ratcliff, J. W.; Metzener, D. E. Pattern matching: the gestalt approach. Dr. Dobb’s Journal, v. 13, n. 7, p. 46, 1988.

Sandanayake, T. C. et al. Automated CV analyzing and ranking tool to select candidates for job positions. In: Proceedings of the 6th International Conference on Information Technology: IoT and Smart City. 2018, Hong Kong. Proceedings […]. New York, NY: Association for Computing Machinery, 2018. p. 13-18. Doi: https://doi.org/10.1145/3301551.3301579

Shahid, M. H.; Islam, M. A. TOC generation in PDF Document for smart automated compliance engine. In: International Symposium on Recent Advances in Electrical Engineering & Computer Sciences (RAEE & CS), 2020, p. 1-5, Islamabad, Pakistan. Proceedings […]. New York: IEEE, 2020. Doi: https://

doi.org/10.1109/raeecs50817.2020.9265792

Tkaczyk, D. et al. Machine learning vs. rules and outof- the-box vs. retrained: an evaluation of open-source bibliographic reference and citation parsers. In: ACM/IEEE on Joint Conference on Digital Libraries, 18., June 3-7, 2018, Fort Worth, Texas, USA. Proceedings […]. New York, NY:

Association for Computing Machinery, 2018. https://doi.org/10.1145/3197026.3197048

Zaman, G.; Mahdin, H.; Hussain, K. Information extraction from semi and unstructured data sources: a systematic literature review. ICIC Express Letters, v. 14, n. 6, p. 593-603, 2020. Doi: https://doi.org/10.24507/icicel.14.06.593

Publicado

23-09-2022

Como Citar

Gil-Leiva, I. ., Fujita, M. S. L., Redigolo, F. M., & Saran, J. F. (2022). Extracción de información de documentos PDF para su uso en la indización automática de e-books. Transinformação, 34, 1–11. Recuperado de https://puccampinas.emnuvens.com.br/transinfo/article/view/6870

Edição

Seção

Originais