Potential of natural language processing for metadata extraction fromenvironmental scientific publications

Blanchy, Guillaume; Albrecht, Lukas; Koestel, John; Garre, Sarah

doi:10.5194/soil-9-155-2023

Sammanfattning

Summarizing information from large bodies of scientific literature is anessential but work-intensive task. This is especially true in environmentalstudies where multiple factors (e.g., soil, climate, vegetation) cancontribute to the effects observed. Meta-analyses, studies thatquantitatively summarize findings of a large body of literature, rely onmanually curated databases built upon primary publications. However, giventhe increasing amount of literature, this manual work is likely to requiremore and more effort in the future. Natural language processing (NLP)facilitates this task, but it is not clear yet to which extent theextraction process is reliable or complete. In this work, we explore threeNLP techniques that can help support this task: topic modeling, tailoredregular expressions and the shortest dependency path method. We apply thesetechniques in a practical and reproducible workflow on two corpora ofdocuments: the Open Tension-diskInfiltrometer Meta-database (OTIM) and the Meta corpus. The OTIM corpus contains the sourcepublications of the entries of the OTIM database of near-saturated hydraulicconductivity from tension-disk infiltrometer measurements(https://github.com/climasoma/otim-db, last access: 1 March 2023). The Meta corpus is constituted ofall primary studies from 36 selected meta-analyses on the impact ofagricultural practices on sustainable water management in Europe. As a firststep of our practical workflow, we identified different topics from theindividual source publications of the Meta corpus using topic modeling.This enabled us to distinguish well-researched topics (e.g., conventionaltillage, cover crops), where meta-analysis would be useful, from neglectedtopics (e.g., effect of irrigation on soil properties), showing potentialknowledge gaps. Then, we used tailored regular expressions to extractcoordinates, soil texture, soil type, rainfall, disk diameter and tensionsfrom the OTIM corpus to build a quantitative database. We were able toretrieve the respective information with 56 % up to 100 % of allrelevant information (recall) and with a precision between 83 % and100 %. Finally, we extracted relationships between a set of driverscorresponding to different soil management practices or amendments (e.g.,"biochar", "zero tillage") and target variables (e.g., "soilaggregate", "hydraulic conductivity", "crop yield") from thesource publications' abstracts of the Meta corpus using the shortestdependency path between them. These relationships were further classifiedaccording to positive, negative or absent correlations between the driverand the target variable. This quickly provided an overview of the differentdriver-variable relationships and their abundance for an entire body ofliterature. Overall, we found that all three tested NLP techniques were ableto support evidence synthesis tasks. While human supervision remainsessential, NLP methods have the potential to support automated evidencesynthesis which can be continuously updated as new publications becomeavailable.

Publicerad i

Soil
2023, volym: 9, nummer: 1, sidor: 155-168
Utgivare: COPERNICUS GESELLSCHAFT MBH

SLU författare

Koestel, Johannes
- Institutionen för mark och miljö, Sveriges lantbruksuniversitet
- Agroscope

UKÄ forskningsämne

Markvetenskap

Publikationens identifierare

DOI: https://doi.org/10.5194/soil-9-155-2023

Permanent länk till denna sida (URI)

https://res.slu.se/id/publ/121887