An introduction to Text Mining for business users

Everything you need to know about TextMining and how business users can leverage Vault to mine the textual data in their business.

Charles Brecque

February 13, 2024

Blog

Whilst humans have been communicating and storing knowledge in the form of text for over 3000 years, computer systems have only started to parse and understand text in the past century. The initial methods for teaching computers how to read and mine text involved hundreds of handwritten rules to describe the inner workings of language. However, the fields of natural language processing (NLP) and text mining have been greatly improved thanks to recent improvements in AI, in particular large language models. This article will provide an overview of text mining, the basic principles behind it and the key use cases for text mining in business.

‍

What is Text Mining?

Text Mining is a branch of AI focused on understanding and analysing text. Text mining can involve entity detection, sentiment analysis or more recently question answering from large neural networks.

Text mining has typically sought inspiration from linguistics to understand how text is structured and how to distinguish the words which carry meaning from those that don't. However, large language models have recently managed to offer superior text mining performance thanks to an increase in compute power which in turn has allowed the models to become much larger and better.

‍

How does Text Mining work?

Machines can only read numbers which is why the first step in text mining is to convert text to numbers. This means that each word will correspond to a code. Each numerical code is then stored in a vector. Modelling text with vectors means that text mining problems can be treated as traditional machine learning problems.

‍

There are different methods for converting text to numbers. In practice, a word or a bag of words is mapped to a vector of numbers. Different techniques such as stemming or lemmatisation can help simplify the conversion from text to vectors. In this step, it is important that the text is mapped in a way that it carries the right signals to the model about its meaning. The appropriate text vectorisation technique will depend on the use case and the nature of the documents you are processing.

‍

What are the main Text Mining use cases in business?

Text mining in business has typically been a challenge due to the fact that textual data in organisations is often stored and formatted in pdf documents. Moreover, this document data is often of poor quality for text mining purposes because files are corrupted, hard to locate or disconnected from parent documents. As a result, preparing the textual data for text mining is often one of the most challenging parts of text mining. For example, an invoice’s formatting can make it difficult for the text to be extracted from the document in a meaningful and systematic way.

‍

Text mining use cases in business have primarily been focused on textual data which is either scraped from websites or stored in databases. For example, businesses have monitored the sentiment of consumer tweets in order to determine the likelihood of them spending on their products or services. However, these types of use cases require machine learning and natural language expertise which is not typically available within non-technical business units.

‍

How can business users mine the textual data in their documents?

Text mining use cases for business users can fall into either of the following categories:

Answering specific questions about a document
Answering aggregate questions about a set of documents
Automatically converting textual data into tabular data which can then be used for reporting or as part of another workflow

‍

A business user can solve all of these text mining use cases with TextMine’s combination of large language models and knowledge graphs.

‍

The first step is to determine which questions or data points need to be extracted. The next step is to identify which documents contain the relevant information and where they are located within the organisation.

‍

TextMine’s large language model, Vault, is then able to extract the key data points from the documents so that business users can answer questions or leverage the extracted data as part of existing workflows.

‍

A screenshot of a document which has been processed by Vault

‍

Whilst the textual data extract capabilities are very powerful, users have the option to review the predictions within Vault. Whilst this step is manual, the user experience ensures that the review is efficient and accurate. Vault is helping solve a number of use cases within procurement, operations and finance teams ranging from customer onboarding to contract compliance.

‍

Building our Large Language Model

TextMine’s large language model has been trained on thousands of contracts and financial documents which means that Vault is able to accurately extract key information about your business critical documents. TextMine’s large language model is self-hosted which means that your data stays within TextMine and is not sent to any third party. Moreover, Vault is flexible meaning it can process documents it hasn’t previously seen and can respond to custom queries.

‍

About TextMine

TextMine is an easy-to-use data extraction tool for procurement, operations, and finance teams. TextMine encompasses 3 components: Legislate, Vault and Scribe. We’re on a mission to empower organisations to effortlessly extract data, manage version controls, and ensure consistency access across all departments. With our AI-driven platform, teams can effortlessly locate documents, collaborate seamlessly across departments, making the most of their business data.