Library and Information Science, Computer Applications For Educated youths (LIS Cafe) an International Education website, (India’s first and World’s largest website for online study to Library and Information Science and Computer Applications through objective and subjective questions).______Knowledge & Thoughts is free at the LIS Cafe. Just bring your own learning Container. -Asheesh Kamal
Blinkie Text Generator at TextSpace.netToday's LIS Word:-Click Here___________

Blinkie Text Generator at TextSpace.netLive EPISODE-59


Blinkie Text Generator at

LIS Cafe Has 10,000 Objective Questions with Answer-Click Here....

Search This LIS Cafe- Enter here

"Share your Knowledge. It is a Way to Achieve Immortality".---Dalai Lama XIV
अपने ज्ञान को साझा करना (शेयर), यह एक तरह से अमरत्व को प्राप्त करने जैसा है- दलाई लामा XIV (Translated By-Asheesh kamal)

Wednesday, August 10, 2016

Document classification: An Overview

Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science. The problems are overlapping, however, and there is therefore interdisciplinary research on document classification.
The documents to be classified may be texts, images, music, etc. Each kind of document possesses its special classification problems. When not otherwise specified, text classification is implied.
Documents may be classified according to their subjects or according to other attributes (such as document type, author, printing year etc.). In the rest of this article only subject classification is considered. There are two main philosophies of subject classification of documents: the content-based approach and the request-based approach.

"Content-based" versus "request-based" classification
Content-based classification is classification in which the weight given to particular subjects in a document determines the class to which the document is assigned. It is, for example, a common rule for classification in libraries, that at least 20% of the content of a book should be about the class to which the book is assigned. In automatic classification it could be the number of times given words appears in a document.

Classification versus indexing
Sometimes a distinction is made between assigning documents to classes ("classification") versus assigning subjects to documents ("subject indexing") but as Frederick Wilfrid Lancaster has argued, this distinction is not fruitful. "These terminological distinctions,” he writes, “are quite meaningless and only serve to cause confusion” (Lancaster, 2003, p. 21. The view that this distinction is purely superficial is also supported by the fact that a classification system may be transformed into a thesaurus and vice versa (cf., Aitchison, 1986, 2004;[5] Broughton, 2008;[6] Riesthuis&Bliedung, 1991. Therefore, is the act of labeling a document (say by assigning a term from a controlled vocabulary to a document) at the same time to assign that document to the class of documents indexed by that term (all documents indexed or classified as X belong to the same class of documents).

Automatic document classification (ADC)
Automatic document classification tasks can be divided into three sorts: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents, unsupervised document classification (also known as document clustering), where the classification must be done entirely without reference to external information, and semi-supervised document classification,[8] where parts of the documents are labeled by the external mechanism. There are several software products under various license models available.

Automatic document classification techniques include:
Expectation maximization (EM)
Naive Bayes classifier
Instantaneously trained neural networks
Latent semantic indexing
Support vector machines (SVM)
Artificial neural network
K-nearest neighbour algorithms
Decision trees such as ID3 or C4.5
Concept Mining
Rough set-based classifier
Soft set-based classifier
Multiple-instance learning
Natural language processing approaches

Classification techniques have been applied to spam filtering, a process which tries to discern E-mail spam messages from legitimate emails
email routing, sending an email sent to a general address to a specific address or mailbox depending on topic
language identification, automatically determining the language of a text
genre classification, automatically determining the genre of a text
readability assessment, automatically determining the degree of readability of a text, either to find suitable materials for different age groups or reader types or as part of a larger text simplification system
sentiment analysis, determining the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document.
Article triage, selecting articles that are relevant for manual literature curation, for example as is being done as the first step to generate manually curated annotation databases in biology.

Reference: Wikipedia
Further reading
FabrizioSebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47, 2002.

Stefan Büttcher, Charles L. A. Clarke, and Gordon V. Cormack. Information Retrieval: Implementing and Evaluating Search Engines. MIT Press, 2010.

Thanks and Regards
Asheesh Kamal

Post a Comment

Must Watch with Song

Review LIS Cafe

Blogger Tips and TricksLatest Tips For BloggersBlogger Tricks

Full form of LIS Cafe

Full form of LIS Cafe

Please Share to the Information.-Creator, Author, Editor and Compiler-Asheesh Kamal