Harnessing the Power of Knowledge Graphs and Large Language Models for Document Data Extraction

How to determine which of your business departments will maximise efficiency by tracking contract deadlines.

February 6, 2024

It is exceedingly difficult to locate information and track it across various spreadsheets. A small error may lead to incorrect data, costing deals or clients, leading to damaged relationships. As the volume of data grows the speed at which you can access the data diminishes. In order to make the right decision, businesses need to rely on accurate and reliable data.

‍

In this article we are going to see how procurement, operations and finance teams can leverage document data extraction, whilst managing security risks and ensuring critical information isn’t disclosed to the wrong people. Automating your process is the best way to improve your transparency.

‍

The Problem With Data

It was reported by Gartner that the financial impact of poor data quality costs organisations $12.9 million each year, in addition research by IBM found businesses lose up to $3.1 trillion every year as a result of data issues.

‍

The IDC anticipates the amount of data created, captured, copied and consumed in the world will continue to climb at breakneck speed. It is estimated that the amount of data created in the next three years will be more than the amount of data created over the last 30 years and the world will create more than three times the data it created in the last five years.

‍

What is the impact of bad data? “Organisations believe poor data quality to be responsible for an average of $15 million per year in losses” as reported by Gartner, they also found 60% of those surveyed didn’t know how much bad data costs their businesses because they didn’t measure it in the first place.

‍

Document data remains the most difficult data to track and manage due to the unstructured nature of text. To proactively mitigate the bad data issue, businesses are in need of modern data management tools that offer visibility into the entire data lifecycle.

‍

Visibility with Vault

With our AI-driven platform Vault, teams can effortlessly locate documents, and the data they contain across departments, making the most of their business data. Users benefit from a personalised knowledge repository, enabling them to find answers swiftly and more efficiently than ever before.

‍

Vault is a data extraction solution, which uses cutting edge large language models and knowledge graph technology to structure the unstructured data in your documents. Users can import their PDF documents from their desktop or by using one of the integrated document management solutions such as Google Drive, DropBox, SharePoint or OneDrive.

‍

‍

Vault uses AI to categorise the documents you upload and extract the key terms within them. Vault is able to derive the implicit insights within the document. Select the PDFs you wish to analyse. You can watch Vault in action as the documents upload.

‍

A screenshot of Vault processing documents using AI

During the upload Vault, is determining the type of document you have uploaded, extracting the key terms and answering valuable questions about the document. The progress bar will be marked as complete once the processing by the model is finished. The screenshot above shows that the previous uploads have been identified as NDA, Employment and Recruitment.

‍

A confidence score is produced by the model, which predicts how confident the model is with its findings. However, the score is not an indicator of the accuracy of results. The score can be improved through manual review. To review Vault’s analysis of the document, select the pencil icon.

‍

A screenshot of a document which has been processed by Vault

‍

The Vault predictions are displayed on the right and can be reviewed manually. Corrections can be made directly from the card. Alternatively, text can be highlighted in the document and marked as a manually created field.

‍

A preview of a document which has been uploaded into Vault

‍

Building our Large Language Model

TextMine’s large language model has been trained on thousands of contracts and financial documents which means that Vault is able to accurately extract key information about your business critical documents. TextMine’s large language model is self-hosted which means that your data stays within TextMine and is not sent to any third party. Moreover, Vault is flexible meaning it can process documents it hasn’t previously seen and can respond to custom queries.

‍

Conclusion

TextMine is an easy-to-use data extraction tool for procurement, operations, and finance teams. TextMine encompasses 3 components: Legislate, Vault and Scribe. We’re on a mission to empower organisations to effortlessly extract data, manage version controls, and ensure consistency access across all departments. With our AI-driven platform, teams can effortlessly locate documents, collaborate seamlessly across departments, making the most of their business data.

‍