Text mining

Text mining, text data mining (TDM), or text analytics is the process of deriving high-quality information from text. It involves “the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources.” These resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by identifying patterns and trends through methods such as statistical pattern learning. According to Hotho et al. (2005), text mining can be viewed from three different perspectives: information extraction, data mining, and knowledge discovery in databases (KDD).

Text mining usually involves structuring the input text (typically through parsing, adding derived linguistic features, and removing others), inserting the structured data into a database, deriving patterns within the structured data, and evaluating and interpreting the output. ‘High quality’ in text mining generally refers to relevance, novelty, and interest. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).

Text analysis involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation, information extraction, data mining techniques including link and association analysis, visualization, and predictive analytics. The overarching goal is to turn text into data for analysis, using natural language processing (NLP), various algorithms, and analytical methods. An important phase of this process is the interpretation of the gathered information.

A typical application of text mining is to scan a set of documents written in a natural language and either model the document set for predictive classification purposes or populate a database or search index with the extracted information. A document is the basic unit of textual data and can exist in many types of collections.

Text Analytics

See also: List of text mining methods

Text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation. The term is roughly synonymous with text mining; indeed, Ronen Feldman modified a 2000 description of “text mining” in 2004 to describe “text analytics”. The latter term is now used more frequently in business settings while “text mining” is used in some of the earliest application areas, dating to the 1980s, notably life-sciences research and government intelligence.

The term text analytics also describes the application of text analytics to respond to business problems, whether independently or in conjunction with query and analysis of fielded, numerical data. It is often noted that 80 percent of business-relevant information originates in unstructured form, primarily text. These techniques and processes discover and present knowledge—facts, business rules, and relationships—that is otherwise locked in textual form, impenetrable to automated processing.

Text Analysis Processes

Subtasks—components of a larger text-analytics effort—typically include:

  • Dimensionality reduction: An important technique for pre-processing data, used to identify the root word for actual words and reduce the size of the text data.[citation needed]
  • Information retrieval: Identification of a corpus as a preparatory step, collecting or identifying a set of textual materials, on the Web or held in a file system, database, or content corpus manager, for analysis.
  • Natural language processing: While some text analytics systems apply exclusively advanced statistical methods, many others apply more extensive natural language processing, such as part of speech tagging, syntactic parsing, and other types of linguistic analysis.
  • Named entity recognition: The use of gazetteers or statistical techniques to identify named text features such as people, organizations, place names, stock ticker symbols, and certain abbreviations.
  • Disambiguation: The use of contextual clues to decide where, for instance, “Ford” refers to a former U.S. president, a vehicle manufacturer, a movie star, a river crossing, or another entity.
  • Recognition of pattern-identified entities: Features such as telephone numbers, email addresses, and quantities (with units) can be discerned via regular expression or other pattern matches.
  • Document clustering: Identification of sets of similar text documents.
  • Coreference resolution: Identification of noun phrases and other terms that refer to the same object.
  • Relationship, fact, and event extraction: Identification of associations among entities and other information in texts.
  • Sentiment analysis: Discerning subjective (as opposed to factual) material and extracting various forms of attitudinal information such as sentiment, opinion, mood, and emotion. Text analytics techniques help analyze sentiment at the entity, concept, or topic level and distinguish opinion holders and objects.
  • Quantitative text analysis: Techniques stemming from the social sciences where either a human judge or a computer extracts semantic or grammatical relationships between words to find the meaning or stylistic patterns of, usually, a casual personal text for purposes such as psychological profiling.
  • Pre-processing: Tasks such as tokenization, filtering, and stemming.

Applications

Text mining technology is broadly applied across various government, research, and business sectors. These groups use text mining for records management and searching documents relevant to their daily activities. Legal professionals may use text mining for e-discovery, for example. Governments and military groups utilize text mining for national security and intelligence purposes. Scientific researchers incorporate text mining approaches to organize large sets of text data (i.e., addressing the problem of unstructured data), determine ideas communicated through text (e.g., sentiment analysis in social media), and support scientific discovery in fields such as the life sciences and bioinformatics. In business, applications support competitive intelligence and automated ad placement, among numerous other activities.

Security Applications

Many text mining software packages are marketed for security applications, especially for monitoring and analyzing online plain text sources such as Internet news and blogs for national security purposes. It is also involved in the study of text encryption and decryption.

Biomedical Applications

Main article: Biomedical text mining

A flowchart of a text mining protocol. An example of a text mining protocol used in a study of protein-protein complexes, or protein docking.

Text mining applications in the biomedical literature include computational approaches to assist with studies in protein docking, protein interactions,, and protein-disease associations. Additionally, with large patient textual datasets in the clinical field, datasets of demographic information in population studies, and adverse event reports, text mining can facilitate clinical studies and precision medicine. Algorithms can stratify and index specific clinical events in large patient textual datasets of symptoms, side effects, and comorbidities from electronic health records, event reports, and diagnostic test reports. Online applications like PubGene combine biomedical text mining with network visualization, and GoPubMed serves as a knowledge-based search engine for biomedical texts. These techniques enable the extraction of unknown knowledge from unstructured clinical documents.

Software Applications

Text mining methods and software are being developed by major firms like IBM and Microsoft to further automate mining and analysis processes, and by various firms in the search and indexing sector to improve results. In the public sector, significant effort has been directed towards creating software for tracking and monitoring terrorist activities. Popular tools for study purposes include Weka software for beginners, NLTK for Python programmers, and the Gensim library for advanced word embedding-based text representations.

Online Media Applications

Text mining is used by large media companies like the Tribune Company to clarify information and enhance search experiences for readers, which increases site “stickiness” and revenue. Editors benefit from the ability to share, associate, and package news across properties, significantly increasing opportunities to monetize content.

Business and Marketing Applications

Text analytics is widely used in business, particularly in marketing, such as in customer relationship management. It is applied to improve predictive analytics models for customer churn (customer attrition). Text mining is also applied in stock returns prediction.

Sentiment Analysis

Sentiment analysis involves analyzing products such as movies, books, or hotel reviews to estimate how favorable a review is for the product. This analysis may require a labeled dataset or labeling the affectivity of words. Resources for word and concept affectivity have been created for WordNet and ConceptNet. Text-based approaches to affective computing have been used on multiple corpora, such as student evaluations, children’s stories, and news stories.

Scientific Literature Mining and Academic Applications

Text mining is crucial for publishers who hold large databases of information that require indexing for retrieval. This is especially true in scientific disciplines, where highly specific information is often contained within written text. Consequently, initiatives such as Nature’s proposal for an Open Text Mining Interface (OTMI) and the National Institutes of Health‘s common Journal Publishing Document Type Definition (DTD) aim to provide semantic cues to machines to answer specific queries within the text without removing publisher barriers to public access.

Academic institutions have also become involved in the text mining initiative:

Methods for Scientific Literature Mining

Computational methods have been developed to assist with information retrieval from scientific literature. Published approaches include methods for searching, determining novelty, and clarifying homonyms among technical reports.

Digital Humanities and Computational Sociology

The automatic analysis of vast textual corpora has created the possibility for scholars to analyze millions of documents in multiple languages with very limited manual intervention. Key enabling technologies include parsing, machine translation, topic categorization, and machine learning.

Narrative network of US Elections 2012 Narrative network of US Elections 2012

The automatic parsing of textual corpora has enabled the extraction of actors and their relational networks on a vast scale, transforming textual data into network data. The resulting networks, which can contain thousands of nodes, are then analyzed using tools from network theory to identify key actors, key communities or parties, and general properties such as robustness or structural stability of the overall network, or centrality of certain nodes. This approach automates the method introduced by quantitative narrative analysis, whereby subject-verb-object triplets are identified with pairs of actors linked by an action, or pairs formed by actor-object.

Content analysis has been a traditional part of social sciences and media studies for a long time. The automation of content analysis has allowed a “big data” revolution in that field, with studies in social media and newspaper content that include millions of news items. Gender bias, readability, content similarity, reader preferences, and even mood have been analyzed using text mining methods over millions of documents. The analysis of readability, gender bias, and topic bias was demonstrated in Flaounas et al., showing how different topics have different gender biases and levels of readability. Additionally, the possibility of detecting mood patterns in a vast population by analyzing Twitter content has been demonstrated as well.

Software

Main article: List of text mining software

Text mining software programs are available from many commercial and open source companies and sources.

Intellectual Property Law

Situation in Europe

Video by Fix Copyright campaign explaining TDM and its copyright issues in the EU, 2016 [3:51]

Under European copyright and database laws, the mining of in-copyright works (such as by web mining) without the permission of the copyright owner is illegal. In the UK in 2014, on the recommendation of the Hargreaves review, the government amended copyright law to allow text mining as a limitation and exception. It was the second country in the world to do so, following Japan, which introduced a mining-specific exception in 2009. However, due to the restriction of the Information Society Directive (2001), the UK exception only allows content mining for non-commercial purposes. UK copyright law does not allow this provision to be overridden by contractual terms and conditions.

The European Commission facilitated stakeholder discussions on text and data mining in 2013, under the title of Licenses for Europe. The focus on licensing as a solution, rather than limitations and exceptions to copyright law, led representatives of universities, researchers, libraries, civil society groups, and open access publishers to leave the stakeholder dialogue in May 2013.

Situation in the United States

US copyright law, and in particular its fair use provisions, makes text mining in America, as well as other fair use countries such as Israel, Taiwan, and South Korea, legal. As text mining is transformative, meaning it does not supplant the original work, it is viewed as lawful under fair use. For example, as part of the Google Book settlement, the presiding judge ruled that Google’s digitization project of in-copyright books was lawful, in part because of the transformative uses the digitization project displayed—one such use being text and data mining.

Situation in Australia

There is no exception in Australian copyright law for text or data mining within the Copyright Act 1968. The Australian Law Reform Commission has noted that it is unlikely that the “research and study” fair dealing exception would extend to cover text or data mining, as it would be beyond the “reasonable portion” requirement.

Implications

Until recently, websites most often used text-based searches, which only found documents containing specific user-defined words or phrases. Now, through the use of a semantic web, text mining can find content based on meaning and context (rather than just by specific words). Additionally, text mining software can build large dossiers of information about specific people and events. For example, large datasets based on data extracted from news reports can facilitate social network analysis or counter-intelligence. In effect, text mining software may act similarly to an intelligence analyst or research librarian, albeit with a more limited scope of analysis. Text mining is also used in some email spam filters to determine the characteristics of messages likely to be advertisements or other unwanted material. Text mining plays an important role in determining financial market sentiment.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *