Recognition and application of geological entities related to ore-forming conditions in the Kaiyang phosphate mine based on the XLNET model
-
摘要:
随着磷矿找矿难度越来越大, 地质勘探成果报告也愈来愈多, 通过人工识别海量文档中与磷矿成矿相关地质信息耗时低效, 无法满足知识共享传播和地质报告智能管理的需求。为快速获得磷矿地质文档报告中隐藏的成矿地质知识, 基于XLNET模型建立了磷矿成矿地质实体自动识别的方法。首先对实体进行BIO标注建立地质实体字典, 利用XLNET作为底层预处理模型学习句子双向语义; 然后使用BILSTM-Attention-CRF模型实现文本多标签的智能分类; 最后通过定位磷矿实体在报告中的分布位置大致推测该处磷矿成矿条件和成矿模式。将该模型与其余3种模型比较得出结果, 该模型识别的准确率(
P )、召回率(R )及F 1值都接近了90%, 较前3种模型分别调高了2%, 5%, 6%。该研究为开阳磷矿地质研究人员提供了更加高效的地质实体自动识别的方法。-
关键词:
- 地质实体识别 /
- XLNET-BILSTM-Attention-CRF /
- 磷矿成矿模式 /
- 预训练模型 /
- 序列标注
Abstract:Objective With increasing difficulty in phosphate ore prospecting, there are an increasing number of geological exploration reports. The manual recognition of geological information related to phosphate rock mineralization in massive documents is time-consuming and inefficient. It cannot meet the needs of knowledge sharing, dissemination and intelligent management of geological reports.
Methods To quickly obtain the ore-forming geological knowledge hidden in the phosphate ore reports, this work intends to establish an automatic recognition method for ore-forming geological entities based on the extreme learning machine network(XLNET) model. First, BIO labelling of entities was carried out to establish a geological entity dictionary, and XLNET was used as the underlying preprocessing model to learn the bidirectional semantics of sentences. Then, the BILSTM-Attention-CRF(bidirectional long short term memory(BILSTM)-self attention layer(Attention)-conditional random field(CRF)) model was used to realize intelligent classification of multiple text labels. Finally, the ore-forming conditions and ore-forming model of phosphate ore in the reports were roughly predicted by locating the distribution position of phosphate ore entities in the report.
Results Comparing this model with the other three models, these results show that the accuracy rate, recall rate and F1 value of this model are close to 90%, which are 2%, 5% and 6% higher than those of the previous three models, respectively.
Conclusion This study provides a more efficient method for automatic geological entity recognition for geological researchers in the Kaiyang phosphate mine.
-
表 1 磷矿实体标签
Table 1. Phosphate ore entity labels
标签 实体类型 KQ 矿区 KC 矿床 KD 矿段 KT 矿体 表 2 磷矿实体属性标签
Table 2. Phosphate ore entity attribute labels
标签 属性类型 LX 类型 GM 规模 SJ 时间 KJ 空间 DC 地层(或断层) KKGZ 控矿构造 CZ 产状 XT 形态 SBLX 蚀变类型 DLGM 定量规模 FYCD 发育程度 表 3 磷矿实体语义关系标签
Table 3. Phosphate ore entity semantic relationship labels
标签 实体类型 R1 (时间) 4种实体和时间 R2 (包含关系) 4种实体之间 R3 (类型与规模) 4种实体和类型与规模 R4 (空间) 4种实体和空间 R5 (基本特征) 4种实体和产状、形态、蚀变类型、定量规模、发育程度 R6 (赋存部位) 4种实体和地层(或断层)、控矿构造 表 4 磷矿实体属性关系标注示例
Table 4. Phosphate ore entity attribute relationship annotation example
实体 实体或属性 关系类型 洋水矿区(KQ) 西翼(KJ) R4 洋水矿区(KQ) 浅-滨海相沉积磷块岩矿床(KC) R2 洋水矿区(KQ) 隆起(KKGZ) R6 洋水矿区(KQ) 深度30~500 m(DLGM) R5 洋水矿区(KQ) 南沱冰期(SJ) R1 洋水矿区(KQ) 海侵(SBLX) R5 洋水矿区(KQ) 层状(XT) R5 洋水矿区(KQ) Ⅴ号矿体(KT) R2 洋水矿区(KQ) Ⅵ号矿体(KT) R2 Ⅴ号矿体(KT) F11(DC) R6 Ⅴ号矿体(KT) 下盘(KJ) R4 … … … 表 5 实验环境参数
Table 5. Experimental environment settings
操作系统 CPU 内存 硬盘容量 GPU Linux AMD EPYC 7543 80GB 80GB A40 表 6 模型参数列表
Table 6. Model parameter list
LSTM隐层维数 最长序列长度 批次大小 AdamW学习率 Dropout 1 024 128 36 3×10-5 0.5 表 7 不同模型地质识别效果
Table 7. Geological recognition effect of different models
模型 准确率P/% 召回率R/% F1值/ % 迭代400次 BiLSTM-CRF 76.41 73.74 75.05 BERT-BiLSTM-CRF 79.52 78.01 78.76 XLNET-BiLSTM-CRF 82.38 83.29 82.32 XLNET-BiLSTM-Attention-CRF 85.75 86.91 86.32 迭代800次 BiLSTM-CRF 84.59 83.11 83.85 BERT-BiLSTM-CRF 85.43 84.23 84.83 XLNET-BiLSTM-CRF 87.51 88.73 88.12 XLNET-BiLSTM-Attention-CRF 89.03 90.86 89.94 表 8 地质实体识别结果
Table 8. Geological entity recognition results
实例 原文信息 人工标注实体 模型识别实体 例一 两岔河矿段位于洋水背斜西翼中北部 KD:“两岔河矿段”, KJ;“洋水背斜西翼中北部” KD: “两岔河矿段”;KJ: “洋水背斜西翼中北部” 例二 目前该矿脉已探明Ⅰ号、Ⅱ号、Ⅲ号矿体 KT:“Ⅰ号”“Ⅱ号”“Ⅲ号” KT:“Ⅰ号”“Ⅱ号”“Ⅲ号”矿体 例三 工作区出露地层主要为青白口系鹅家坳组 DC:“青白口系鹅家坳组” DC:“青白口系鹅家坳组” 例四 贵州省开阳磷矿洋水矿区两岔河矿段(南段)磷矿勘探 KT:“磷矿”; KQ;“洋水矿区”; KD;“两岔河矿段” KT:“磷矿”;KQ:“洋水矿区”;KD:“两岔河矿段” 表 9 模型在扩大训练集规模之后的地质实体识别结果
Table 9. Geological entity recognition results of the model after expanding the training set
训练集规模 P/% R/% F1值/% 5万余字 91.36 94.67 92.99 7万余字 93.42 95.36 94.38 -
[1] JIN G Z, LIN Y F, XIAO G M, et al. SEM+: Tool for discovering concept mapping in earth science related domain[J]. Earth Science Informatics, 2015, 8(1): 95-102. doi: 10.1007/s12145-014-0203-1 [2] LIANG W, LEI X, CHAO L L, et al. A knowledge-driven geospatially enabled framework for geological big data[J]. ISPRS International Journal of Geo-Information, 2017, 6(6): 166-166. doi: 10.3390/ijgi6060166 [3] QIN J Q, XIE Z, WU L, et al. BiLSTM-CRF for geological named entity recognition from the geoscience literature[J]. Earth Science Informatics, 2019, 12(4): 565-579. doi: 10.1007/s12145-019-00390-3 [4] QIN J Q, ZHONG X, LIANG W, et al. GNER: A generative model for geological named entity recognition without labeled data using deep learning[J]. Earth and Space Science, 2019, 6(6): 931-946. doi: 10.1029/2019EA000610 [5] 储德平, 万波, 李红, 等. 基于ELMO-CNN-BiLSTM-CRF模型的地质实体识别[J]. 地球科学, 2021, 46(8): 3039-3048. https://www.cnki.com.cn/Article/CJFDTOTAL-DQKX202108028.htmCHU D P, WAN B, LI H, et al. Geological entity recognition based on ELMO-CNN-BiLSTM-CRF model[J]. Earth Science, 2021, 46(8): 3039-3048. (in Chinese with English abstract) https://www.cnki.com.cn/Article/CJFDTOTAL-DQKX202108028.htm [6] 邱芹军. 基于地质报告文本的时空及主题提取关键技术研究[D]. 武汉: 中国地质大学(武汉), 2020.QIU Q J. Rsearch on the key technologies of spatio-temporal and topic extraction based on geological report text[D]. Wuhan: China University of Geosciences (Wuhan), 2020. (in Chinese with English abstract) [7] 吴冲龙, 刘刚, 周琦, 等. 地质科学大数据统合应用的基本问题[J]. 地质科技通报, 2020, 39(4): 1-11. doi: 10.19509/j.cnki.dzkq.2020.0401WU C L, LIU G, ZHOU Q, et al. Fundamental problems of intergrated application of big data in geoscience[j]. Bulletin of Geological Science and Technology, 2020, 39(4): 1-11. (in Chinese with English abstract) doi: 10.19509/j.cnki.dzkq.2020.0401 [8] 张雪英, 叶鹏, 王曙, 等. 基于深度信念网络的地质实体识别方法[J]. 岩石学报, 2018, 34(2): 343-351. https://www.cnki.com.cn/Article/CJFDTOTAL-YSXB201802011.htmZHANG X Y, YE P, WANG S, et al. Geological entity recognition method based on deep belief networks[J]. Acta Petrologica Sinica, 2018, 34(2): 343-351. (in Chinese with English abstract) https://www.cnki.com.cn/Article/CJFDTOTAL-YSXB201802011.htm [9] 马凯. 地质大数据表示与关联关键技术研究[D]. 武汉: 中国地质大学(武汉), 2018.MA K. Research on the key technologies of geological big data representation and association[D]. Wuhan: China University of Geosciences (Wuhan), 2018. (in Chinese with English abstract) [10] 谢雪景, 谢忠, 马凯, 等. 结合BERT与BiGRU-Attention-CRF模型的地质命名实体识别[J]. 地质通报, 2023, 42(5): 846-855. https://www.cnki.com.cn/Article/CJFDTOTAL-ZQYD202305014.htmXIE X J, XIE Z, MA K, et al. Geological named entity recognition combined BERT and BiGRU-Attention-CRF model[J]. Geological Bulletin of China, 2023, 42(5): 846-855. (in Chinese with English abstract) https://www.cnki.com.cn/Article/CJFDTOTAL-ZQYD202305014.htm [11] 张春菊, 张磊, 陈玉冰, 等. 基于BERT的交互式地质实体标注语料库构建方法[J]. 地理与地理信息科学, 2022, 38(4): 7-12. https://www.cnki.com.cn/Article/CJFDTOTAL-DLGT202204002.htmZHANG C J, ZHANG L, CHEN Y B, et al. BERT-based interactive geological entity annotation corpus construction method[J]. Geography & Geographic Information Science, 2022, 38(4): 7-12. (in Chinese with English abstract) https://www.cnki.com.cn/Article/CJFDTOTAL-DLGT202204002.htm [12] 王刘坤, 李功权. 基于GeoERNIE-BiLSTM-Attention-CRF模型的地质命名实体识别[J]. 地质科学, 2023, 58(3): 1164-1177. https://www.cnki.com.cn/Article/CJFDTOTAL-DZKX202303022.htmWANG L K, LI G Q. Geological named entity recognition based on GeoERNIE-BiLSTM-Attention-CRF model[J]. Chinese Journal of Geology, 2023, 58(3): 1164-1177. (in Chinese with English abstract) https://www.cnki.com.cn/Article/CJFDTOTAL-DZKX202303022.htm [13] ZHAO H, HUANG C N, MU L, et al. An improved Chinese word segmentation system with conditional random field[C]//Anon. Proceedings of the fifth sighan workshop on Chinese language processing. 2006: 108-117. [14] 梁坤萍, 程国繁, 覃庆炎, 等. 贵州织金新华磷矿区风化磷块岩形成条件及风化淋滤富集机制初步研究[J]. 地质科技通报, 2022, 41(4): 172-183. doi: 10.19509/j.cnki.dzkq.2022.0110LIANG K P, CHENG G F, QING Q Y, et al. A preliminary study on the formation conditions and weathering leaching enrichment mechanism of secondary phosphorite in the Xinhua phosphate mining area, Zhijin, Guizhou[J]. Bulletin of Geological Science and Technology, 2022, 41(4): 172-183. (in Chinese with English abstract) doi: 10.19509/j.cnki.dzkq.2022.0110 [15] 程国繁, 何英. 贵州册亨板其风化型磷矿成矿条件与成矿模式[J]. 矿物学报, 2016, 36(2): 189-197. https://www.cnki.com.cn/Article/CJFDTOTAL-KWXB201602004.htmCHEN G F, HE Y. A preliminary study on ore-forming conditions and its model for Banqi secondary phosphate deposit, Ceheng County, Guizhou Province, China[J]. Acta Mineralogica Sinica, 2016, 36(2): 189-197. (in Chinese with English abstract) https://www.cnki.com.cn/Article/CJFDTOTAL-KWXB201602004.htm [16] 张亚冠, 杜远生, 陈国勇, 等. 富磷矿三阶段动态成矿模式: 黔中开阳式高品位磷矿成矿机制[J]. 古地理学报, 2019, 21(2): 351-368. https://www.cnki.com.cn/Article/CJFDTOTAL-GDLX201902011.htmZHANG Y G, DU Y S, CHEN G Y, et al. Three stages dynamic mineralization model of the phosphate-rich deposits: Mineralization mechanism of the Kaiyang-type high-grade phosphorite in central Guizhou Province[J]. Journal of Palaeogeography, 2019, 21(2): 351-368. (in Chinese with English abstract) https://www.cnki.com.cn/Article/CJFDTOTAL-GDLX201902011.htm [17] 姚贵斌, 张起贵. 基于XLnet语言模型的中文命名实体识别[J]. 计算机工程与应用, 2021, 57(18): 156-162. https://www.cnki.com.cn/Article/CJFDTOTAL-JSGG202118019.htmYAO G B, ZHANG Q G. Chinese named entity recognition based on XLnet language model[J]. Computer Engineering and Applications, 2021, 57(18): 156-162. (in Chinese with English abstract) https://www.cnki.com.cn/Article/CJFDTOTAL-JSGG202118019.htm [18] WANG C B, LI Y J, CHEN J G, et al. Named entity annotation schema for geological literature mining in the domain of porphyry copper deposits[J]. Ore Geology Reviews, 2022, 152: 105243. [19] 王龙辉, 剡鹏兵, 焦养泉, 等. 鄂尔多斯盆地北部下白垩统铀成矿模式[J]. 地质科技通报, 2023, 42(3): 222-233. doi: 10.19509/j.cnki.dzkq.2022.0096WANG L H, YAN P B, JIAO Y Q, et al. Uranium metallogenic model of the Lower Cretaceous in the northern Ordos Basin[J]. Bulletin of Geological Science and Technology, 2023, 42(3): 222-233. (in Chinese with English abstract) doi: 10.19509/j.cnki.dzkq.2022.0096