Text analysis, which is the next step in search technology, refers to the process of extracting interesting and non-trivial information and knowledge from unstructured text. Text analysis differs from traditional search in that, whereas search requires a user to know what he or she is looking for, text analysis attempts to discover information in a pattern that is not known beforehand. This is achieved through the use of advanced techniques such as pattern recognition, natural language processing, machine learning, and so on. By focusing on patterns and characteristics, text analysis can produce better search results and deeper data analysis, thereby providing quick retrieval of information that otherwise would remain hidden.
Text Analysis for High Recall and Optimum Precision
Text analysis is particularly interesting in areas where users must discover new information, such as criminal investigations, legal discovery, and when performing due diligence investigations. Such investigations require 100% recall because the users cannot afford to miss any relevant information. In contrast, someone who uses a standard search engine to search the Internet for background information simply requires any information to be returned – so long as it is reliable. During eDiscovery or due diligence, an attorney or investigator needs to uncover all possible liabilities, not just the obvious ones.
Text analysis technology is balanced with a variety of tools to refine and filter results, rank by relevance, de-dupe hits, find patterns, and prevent users from having to sift through huge volumes of irrelevant information. Our technology leverages various mathematical, statistical, linguistic and pattern-recognition techniques that allow automatic analysis of unstructured information as well as the extraction of high quality and relevant data.
Examples of text-mining and text-analytics that are: automatic summaries, entity and regular expression extraction for more than 200 different entity types (names, job titles, companies, addresses, countries, social security numbers, credit card numbers, dates, payments, bank accounts and many more), event and fact finding, concept extraction, document property extraction, file property extraction, graphical file detection, automatic language recognition (also for OCR of non-OCR-ed bitmaps), and exact and near-duplicate detection.
Text Analysis for Multi-Language Collections
Text analysis should support multiple languages, which is critical when investigations go global and incorporate collections of information in various languages. On shoud reconcile differences in character sets and words, but it also makes intensive use of statistics and the linguistic properties (i.e., conjugation, grammar, senses or meanings) of a language.
Text analytics are really the search of the future: one can see many examples in the popularity of faceted and exploratory search in many enterprise search applications and on many public websites. Without automatic text analysis, it would be impossible and way too expensive to manually structure and add meta information required for these types of search techniques.