Fishery standard named entity recognition based on BERT+BiLSTM+CRF deep learning model and multivariate combination data augmentation
YANG He, YU Hong, LIU Jusheng, YANG Huining,SUN Zhetao,CHENG Ming, REN Yuan,ZHANG Sijia
1.College of Information Engineering, Key Laboratory of Marine Information Technology of Liaoning Province, Dalian Ocean University, Dalian 116023, China; 2.Key Laboratory of Environment Controlled Aquaculture, Ministry of Education, Dalian 116023, China
摘要 为解决渔业标准命名实体识别任务中部分实体语料分布稀疏导致的效果不佳问题,提出了基于多元组合数据增广(data augmentation method based on multiple combination,MCA)的渔业标准命名实体识别方法,该方法融合了基于领域词典的联合替换算法(joint replacement algorithm based on domain dictionary,DDR)、基于槽点保护的随机删除算法(random deletion algorithm based on slot protection,SPD)和基于槽点保护的随机插入算法(random insertion algorithm based on slot protection,SPI)进行语料库的数据增广,首先构建“水产品名称”同类词词典和领域同义词词典,通过两个词典分别对“水产品名称”类实体和随机词进行同类词替换和同义词替换,生成新的句子,以增加目标实体数量和句子的多样性,然后在基于槽点保护的情况下对原句子分别进行随机删除和随机插入操作,在保留实体及上下文特征的情况下进一步丰富语料的多样性,提高模型的泛化能力。结果表明,采用基于融合注意力机制的BERT+BiLSTM+CRF网络模型和多元组合数据增广方法进行渔业标准命名实体识别,准确率、召回率、F1值分别达到了91.73%、88.64%、90.16%,具有较好的效果。研究表明,基于多元组合数据增广的渔业标准命名实体识别方法有效解决了部分实体样本稀疏问题,提升了渔业标准命名实体识别的整体效果。
Abstract: In order to solve the problem of poor effect caused by sparse corpus distribution of some entities in fishery standard named entity recognition task, a method of fishery standard named entity recognition is proposed based on multiple combination data enlargement, which combines the joint replacement algorithm based on domain dictionary(DDR), random deletion algorithm based on slot protection(SPD)and random insertion algorithm based on slot protection(SPI)to augment the data of corpus.First the“name”of fishery products similar synonyms dictionary word dictionary and domain is established through two dictionaries in fishery product name entity and random words similar to replace and synonyms replacement, new sentences are generated to increase the number of target entities and the diversity of the sentence, and then in the case of trough point based protection of the original sentence for random delete and random insertion operation respectively, while keeping the entity and its context feature rich the diversity of the corpora, under the condition of improving the generalization ability of the model.In order to verify the effectiveness of the proposed method, several groups of comparative experiments were designed.The results showed that there was 91.73% of identification accuracy, 88.64% of recall rate and 90.16% of F1 value of fishery standard named entity based on the method of multiple combination data amplification, with a good effect.The findings indicate that the fishery standard named entity recognition method based on multivariate combination data augmentation proposed here effectively solves the problem of sparse part of the entity samples and improves the overall effect of fishery standard named entity recognition.
杨鹤, 于红, 刘巨升, 杨惠宁, 孙哲涛, 程名, 任媛, 张思佳. 基于BERT+BiLSTM+CRF深度学习模型和多元组合数据增广的渔业标准命名实体识别[J]. 大连海洋大学学报, 2021, 36(4): 661-669.
YANG He, YU Hong, LIU Jusheng, YANG Huining, SUN Zhetao, CHENG Ming, REN Yuan, ZHANG Sijia. Fishery standard named entity recognition based on BERT+BiLSTM+CRF deep learning model and multivariate combination data augmentation. Journal of Dalian Ocean University, 2021, 36(4): 661-669.