Performance of Classification Tools on Unstructured Text

  • Janet L. Kourik

    Student thesis: Doctoral ThesisDoctor of Philosophy

    Abstract

    As digital storage of data continues to grow it is increasingly difficult to find information on demand, particularly in unstructured text documents. Unstructured documents lack explicit record definitions or other metadata that can facilitate retrieval. Yet, digital information is increasingly stored in unstructured documents. Manual or human-assisted indexing of unstructured documents is time consuming and expensive. Automated retrieval techniques, such as those used by Internet search engines, have a variety of limitations including depth and breadth of coverage and frequency of update. In addition many retrieval methods become impractical on large document collections where the need for improved performance is even greater. Most text indexing and retrieval systems include a component that classifies documents. This research will focus on the automated classification of unstructured text. The goal of this research was to investigate a commercial classification tool and evaluate the tool's performance on Reuters-21578, a benchmark categorization collection of unstructured text. The performance of a commercial-off-the-shelf(COTS) product, Oracle Text on the Reuters-21578 collection was evaluated using a variety of measures documented in the classification literature.
    Date of AwardJan 1 2005
    Original languageEnglish
    SupervisorSumitra Mukherjee (Supervisor), Marlyn Kemper Littman (Advisor) & Maxine S. Cohen (Advisor)

    Cite this

    '