请选择 进入手机版 | 继续访问电脑版

Supervised Term Weighting for Automated Text Categorization—&mda

[复制链接]
东方龙头 发表于 2020-12-31 20:24:24 | 显示全部楼层 |阅读模式 打印 上一主题 下一主题
“Text categorization (TC) is the activity of automatically building, by means of machine learning (ML) techniques, automatic text classifiers, i.e. programs capable of labelling natural language texts from a domain D with thematic categories from a predefined set C={c1,… ,c|C|} [10]. The construction of an automatic text classifier relies on the existence of an initial corpus D={d1, …, d|D|} of documents preclassified under C. A general inductive process (called the learner) automatically builds a classifier for C by learning the characteristics of C from a training set Tr= {d1,… , d|Tr|} of documents. Once a classifier has been built, its effectiveness (i.e. its capability to take the right categorization decisions) may be tested by applying it to the test set Te = D-Tr and checking the degree of correspondence between the decisions of the classifier and those encoded in the corpus. This is called a supervised learning activity, since learning is “supervised” by the information on the membership of training documents in categories.”
文天职类(TC)是通过呆板学习(ML)技能自动构建自动文天职类器的活动,即能够用预界说集C={c1,… ,c|C|}的主题种别来标记来自范畴D的自然语言文本的程序[10]。自动文天职类器的构建依赖于根据聚集C预先分类的初始语料库文本D={d1, …, d|D|}。一般归纳过程(称为学习者)通过从文本训练集Tr= {d1,… , d|Tr|}中学习C的特征,自动为C构建分类器。一旦创建了分类器,可以通过将其应用于测试集Te = D-Tr,并查抄分类器的决策与语料库中的编码之间的一致水平来测试其有效性(即做出正确分类决策的本事)。这被称为监督学习活动,因为学习是被训练文本的种别信息“监督”的。
“The construction of a text classifier may be seen as consisting of essentially two phases:
1. document indexing, i.e. the creation of internal representations for documents. This typically consists in
(a) term selection, consisting in the selection, from the set T (that contains of all the terms that occur in the documents of Tr), of the subset T’                                    ⊂                              \subset                  ⊂ T of terms that, when used as dimensions for document representation, are expected to yield the best effectiveness; and
(b) term weighting, in which, for every term tn selected in phase (1a) and for every document dj, a weight                                    0                         ≤                                   w                                       k                               j                                            ≤                         1                              0 \le w_{kj} \le 1                  0≤wkj​≤1 is computed which represents, loosely speaking, how much term tk contributes to the discriminative semantics of document dj;
2. a phase of classifier learning, i.e. the creation of a classifier by learning from the internal representations of the training documents.”

文天职类器的构造根本上可以分为两个阶段:

  • 文档索引,即为文档创建内部表示。这通常包罗:
    (a) 词选择,聚集T(包罗文本Tr中出现的所有词项)的子集T’                                    ⊂                              \subset                  ⊂ T的词项被用来作为文本表示的特征维,盼望能够产生最好的效果;
    (b) 词加权,对于每一个在(1a)中被选择的词项以及每一个文本dj,盘算的权重                                   0                         ≤                                   w                                       k                               j                                            ≤                         1                              0 \le w_{kj} \le 1                  0≤wkj​≤1,广义地说,词项tk对文档dj语义的辨别有多大贡献;
  • 分类器学习阶段,即通过学习训练文本的内部表示来创建分类器。
“Traditionally, supervised learning affects only phases (1a) and (2). In this paper we propose instead that supervised learning is used also in phase (1b), so as to make the weight wkj reflect the importance of term tk in deciding the membership of dj to the categories of interest. We call this idea supervised term weighting (STW).
Concerning the computation of term weights, we propose that phase (1b) capitalizes on the results of phase (1a), since the selection of the best terms is usually accomplished by scoring each term tk by means of a function f(tk, ci) that measures its capability to discriminate category ci, and then selecting the terms that maximize f(tk, ci). In our proposal the f(tk, ci) scores are not discarded after term selection, but become an active ingredient of the term weight.”

传统的监督学习只影响阶段(1a)和(2)。本文提出将监督学习也用于阶段(1b),以使权值wkj反映词项tk在决定dj对目的种别从属度时的重要性。我们称这种想法为监督词加权方法(STW)。
有关词项权重的盘算,我们提出阶段(1b)使用阶段(1a)的结果,因为最佳词项的选择通常是使用函数f(tk, ci)盘算每一词项tk的得分,权衡其区分种别ci的本事,然后选择使f(tk, ci)最大化的词项。在我们的方案中,f(tk, ci)分数在词项选择后不会被扬弃,而是成为词权重的一个重要组成部分。
“The TC literature discusses two main policies to perform term selection: (a) a local policy, where different sets of terms T’                                    ⊂                              \subset                  ⊂ T are selected for different categories ci, and (b) a global policy, where a single set of terms T’                                    ⊂                              \subset                  ⊂ T is selected by extracting a single score fglob(tk) from the individual scores f(tk, ci). In this paper we experiment with both policies, but always using the same policy for both term selection and term weighting. A consequence of adopting the local policy and reusing the scores for term weighting is that weights, traditionally a function of a term tk and a document dj, now also depend on a category ci; this means that, in principle, the representation of a document is no more a vector of |T’| terms, but a set of vectors of T’i terms, with i= 1,…, |C|.”
TC文献讨论了两种主要的词项选择计谋:(a) 局部计谋,对差别的种别 ci选择差别的词项聚集T’                               ⊂                          \subset               ⊂ T;(b) 全局计谋,根据从每个分数f(tk, ci)中提取出的单个分数fglob(tk)来选择词项T’                               ⊂                          \subset               ⊂ T 的聚集。在本文中,我们对两种计谋进行了实验,但在词选择和词加权上,我们总是使用相同的计谋。采用局部计谋和重用词权重得分的结果是,权重传统上是词项tk和文档dj的函数,现在也依赖于种别ci;这意味着,原则上,文本表示不再是由|T’|项组成的向量,而是由T’i项组成的向量聚集,i= 1,…|C|。
参考资料:
[10] F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1-47,2002.

来源:https://blog.csdn.net/qq_33790600/article/details/111957807
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

发布主题

专注素材教程免费分享
全国免费热线电话

18768367769

周一至周日9:00-23:00

反馈建议

27428564@qq.com 在线QQ咨询

扫描二维码关注我们

Powered by Discuz! X3.4© 2001-2013 Comsenz Inc.( 蜀ICP备2021001884号-1 )