...
首页> 外文期刊>Frontiers in Research Metrics and Analytics >The Termolator: Terminology Recognition Based on Chunking, Statistical and Search-Based Scores
【24h】

The Termolator: Terminology Recognition Based on Chunking, Statistical and Search-Based Scores

机译:The Termolator:基于分块,统计和基于搜索的分数的术语识别

获取原文
           

摘要

he Termolator is an open-source high-performing terminology extraction system, available on Github. The Termolator combines several different approaches to get superior coverage and precision. The in-line term component identifies potential instances of terminology using a chunking procedure, similar to noun group chunking, but favoring chunks that contain out-of-vocabulary words, nominalizations, technical adjectives, and other specialized word classes. The distributional component ranks such term chunks according to several metrics including: (a) a set of metrics that favors term chunks that are relatively more frequent in a “foreground” corpus about a single topic than they are in a “background” or multi-topic corpus; (b) a well-formedness score based on linguistic features and (c) a relevance score which measures how often terms appear in articles and patents in a Yahoo web search. We analyse the contributions made by each of these components and show that all modules contribute to the system’s performance, both in terms of the number and quality of terms identified. This paper expands upon previous publications about this research and includes descriptions of some of the improvements made since its initial release. This study also includes a comparison with another terminology extraction system available on-line, Termostat (Drouin 2003).. We found that the systems get comparable results when applied to small amounts of data: about 50% precision for a single foreground file (Einstein’s Theory of Relativity). However, when running the system with 500 patent files as foreground, Termolator performed significantly better than Termostat. For 500 refrigeration patents, Termolator got 70% precision vs Termostat’s 52%. For 500 semiconductor patents, Termolator got 79% precision vs Termostat’s 51%.
机译:Termolator是Github上的一个开源高性能术语提取系统。 Termolator结合了几种不同的方法来获得卓越的覆盖范围和精度。内联术语组件使用分块过程来识别潜在的术语实例,类似于名词组分块,但是更喜欢包含词汇外词,名词化词,技术形容词和其他专门词类的块。分布组件会根据几个指标对此类术语大块进行排名,这些指标包括:(a)一组术语集,这些术语大块在有关单个主题的“前景”语料库中比在“背景”或多主题中相对更频繁地使用主题语料库; (b)基于语言特征的格式良好评分;以及(c)衡量雅虎网络搜索中术语在文章和专利中出现的频率的相关性评分。我们分析了每个组件所做出的贡献,并表明所有模块在确定的术语数量和质量方面都对系统的性能做出了贡献。本文扩展了有关该研究的先前出版物,并描述了自最初发布以来所做的一些改进。该研究还包括与另一种在线术语提取系统Termostat(Drouin 2003)的比较。我们发现,该系统在应用于少量数据时可获得可比的结果:单个前景文件的精度约为50%(爱因斯坦相对论)。但是,当运行以500个专利文件为前台的系统时,Termolator的性能明显优于Termostat。对于500项制冷专利,Termolator的精度为70%,而Termostat的精度为52%。对于500项半导体专利,Termolator的精度为79%,而Termostat的为51%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号