{"id":203,"date":"2016-12-12T21:41:10","date_gmt":"2016-12-12T20:41:10","guid":{"rendered":"http:\/\/liacs.leidenuniv.nl\/~kraaijw\/?p=203"},"modified":"2017-01-17T20:46:51","modified_gmt":"2017-01-17T19:46:51","slug":"evaluation-analysis-term-scoring-methods-term-extraction","status":"publish","type":"post","link":"https:\/\/liacs.leidenuniv.nl\/~kraaijw\/index.php\/2016\/12\/12\/evaluation-analysis-term-scoring-methods-term-extraction\/","title":{"rendered":"Evaluation and analysis of term scoring methods for term extraction"},"content":{"rendered":"<div class=\"MainTitleSection\">\n<h1 class=\"ArticleTitle\" lang=\"en\"><u><span style=\"color: #0066cc;\">\u00a0<\/span><\/u><\/h1>\n<\/div>\n<div class=\"authors u-clearfix authors--enhanced\" data-component=\"SpringerLink-Authors\">\n<div id=\"authors\" class=\"authors__list\" data-role=\"AuthorsList\">\n<ul>\n<li><span class=\"authors__name\">Suzan\u00a0Verberne<\/span><\/li>\n<li><span class=\"authors__name\">Maya\u00a0Sappelli<\/span><\/li>\n<li><span class=\"authors__name\">Djoerd\u00a0Hiemstra<\/span><\/li>\n<li><span class=\"authors__name\">Wessel\u00a0Kraaij<\/span><\/li>\n<\/ul>\n<\/div>\n<div id=\"authorsandaffiliations\" class=\"authors__affiliations\">\n<div class=\"authors-affiliations u-interface\">\n<ul>\n<li><span class=\"affiliation__count\">3.<\/span><span class=\"affiliation__item\"><span class=\"affiliation__name\">University of Twente<\/span><span class=\"affiliation__city\">Enschede<\/span><span class=\"affiliation__country\">The Netherlands<\/span><\/span><\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"article-context__container\" data-component=\"SpringerLink-ArticleMetrics\">\n<div class=\"article-context__primary\">\n<p><span id=\"open-choice-icon\" class=\"open-access\"><abbr title=\"This content is freely available to anyone, anywhere, at any time\">Open Access<\/abbr><\/span>Article<\/p>\n<div class=\"article-dates article-dates--enhanced\" data-component=\"SpringerLink-ArticleDates\">\n<dl>\n<dt>First Online:<\/dt>\n<dd class=\"article-dates__first-online\"><a class=\"gtm-first-online\" href=\"http:\/\/link.springer.com\/article\/10.1007\/s10791-016-9286-2#article-dates-history\"><u><span style=\"color: #0066cc;\">10 August 2016<\/span><\/u><\/a><\/dd>\n<\/dl>\n<div id=\"article-dates-history\">\n<dl class=\"article-dates__history\">\n<dt>Received:<\/dt>\n<dd><time datetime=\"2016-02-15\">15 February 2016<\/time><\/dd>\n<dt>Accepted:<\/dt>\n<dd><time datetime=\"2016-07-28\">28 July 2016<\/time><\/dd>\n<\/dl>\n<\/div>\n<\/div>\n<p class=\"article-doi\"><abbr title=\"Digital Object Identifier\">DOI<\/abbr>: 10.1007\/s10791-016-9286-2<\/p>\n<\/div>\n<dl class=\"article-cite\">\n<dt>Cite this article as:<\/dt>\n<dd id=\"citethis-text\">Verberne, S., Sappelli, M., Hiemstra, D. et al. Inf Retrieval J (2016) 19: 510. doi:10.1007\/s10791-016-9286-2<\/dd>\n<\/dl>\n<ul class=\"article-metrics\">\n<li class=\"article-metrics__item\"><a id=\"socialmediamentions-link\" class=\"article-metrics__link\" title=\"Visit Altmetric for full social mention details\" href=\"http:\/\/www.altmetric.com\/details.php?citation_id=10482674&amp;domain=link.springer.com\" target=\"_blank\" rel=\"noopener noreferrer\"><u><span style=\"color: #0066cc;\"><span id=\"socialmediamentions-count-number\" class=\"button-circle button-circle--blue gtm-socialmediamentions-count\">8<\/span><span class=\"article-metrics__label gtm-socialmediamentions-count\">Shares<\/span> <\/span><\/u><\/a><\/li>\n<li class=\"article-metrics__item\"><span class=\"article-metrics__views\">1.4k<\/span><span class=\"article-metrics__label\">Downloads<\/span><\/li>\n<\/ul>\n<\/div>\n<section id=\"Abs1\" class=\"Abstract\" lang=\"en\" tabindex=\"-1\">\n<h2 class=\"Heading\">Abstract<\/h2>\n<p id=\"Par1\" class=\"Para\">We evaluate five term scoring methods for automatic term extraction on four different types of text collections: personal document collections, news articles, scientific articles and medical discharge summaries. Each collection has its own use case: author profiling, boolean query term suggestion, personalized query suggestion and patient query expansion. The methods for term scoring that have been proposed in the literature were designed with a specific goal in mind. However, it is as yet unclear how these methods perform on collections with characteristics different than what they were designed for, and which method is the most suitable for a given (new) collection. In a series of experiments, we evaluate, compare and analyse the output of six term scoring methods for the collections at hand. We found that the most important factors in the success of a term scoring method are the size of the collection and the importance of multi-word terms in the domain. Larger collections lead to better terms; all methods are hindered by small collection sizes (below 1000 words). The most flexible method for the extraction of single-word and multi-word terms is pointwise Kullback\u2013Leibler divergence for informativeness and phraseness. Overall, we have shown that extracting relevant terms using unsupervised term scoring methods is possible in diverse use cases, and that the methods are applicable in more contexts than their original design purpose.<\/p>\n<\/section>\n","protected":false},"excerpt":{"rendered":"<p>\u00a0 Suzan\u00a0Verberne Maya\u00a0Sappelli Djoerd\u00a0Hiemstra Wessel\u00a0Kraaij 3.University of TwenteEnschedeThe Netherlands Open AccessArticle First Online: 10 August 2016 Received: 15 February 2016 Accepted: 28 July 2016 DOI: 10.1007\/s10791-016-9286-2 Cite this article as: Verberne, S., Sappelli, M., Hiemstra, D. et al. Inf Retrieval J (2016) 19: 510. doi:10.1007\/s10791-016-9286-2 8Shares 1.4kDownloads Abstract We evaluate five term scoring methods for [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7],"tags":[],"class_list":["post-203","post","type-post","status-publish","format-standard","hentry","category-publication"],"_links":{"self":[{"href":"https:\/\/liacs.leidenuniv.nl\/~kraaijw\/index.php\/wp-json\/wp\/v2\/posts\/203","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/liacs.leidenuniv.nl\/~kraaijw\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/liacs.leidenuniv.nl\/~kraaijw\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/liacs.leidenuniv.nl\/~kraaijw\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/liacs.leidenuniv.nl\/~kraaijw\/index.php\/wp-json\/wp\/v2\/comments?post=203"}],"version-history":[{"count":3,"href":"https:\/\/liacs.leidenuniv.nl\/~kraaijw\/index.php\/wp-json\/wp\/v2\/posts\/203\/revisions"}],"predecessor-version":[{"id":238,"href":"https:\/\/liacs.leidenuniv.nl\/~kraaijw\/index.php\/wp-json\/wp\/v2\/posts\/203\/revisions\/238"}],"wp:attachment":[{"href":"https:\/\/liacs.leidenuniv.nl\/~kraaijw\/index.php\/wp-json\/wp\/v2\/media?parent=203"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/liacs.leidenuniv.nl\/~kraaijw\/index.php\/wp-json\/wp\/v2\/categories?post=203"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/liacs.leidenuniv.nl\/~kraaijw\/index.php\/wp-json\/wp\/v2\/tags?post=203"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}