Authors:
(1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Ziqi.Zhang@sheffield.ac.uk);
(2) Tomas Jasaitis, Vamstar Ltd., London (Tomas.Jasaitis@vamstar.io);
(3) Richard Freeman, Vamstar Ltd., London (Richard.Freeman@vamstar.io);
(4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Rowida.Alfrjani@sheffield.ac.uk);
(5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Adam.Funk@sheffield.ac.uk).
Table of Links
-
Domain and Task
-
Related Work
3.1. Text mining and NLP research overview
3.2. Text mining and NLP in industry use
-
4.6. XML parsing, data joining, and risk indices development
-
Experiment and Demonstration
-
Discussion
6.1. The ‘industry’ focus of the project
6.2. Data heterogeneity, multilingual and multi-task nature
Abstract. While text mining and NLP research has been established for decades, there remain gaps in the literature that reports the use of these techniques in building real-world applications. For example, they typically look at single and sometimes simplified tasks, and do not discuss in-depth data heterogeneity and inconsistency that is common in real-world problems or their implication on the development of their methods. Also, few prior work has focused on the healthcare domain. In this work, we describe an industry project that developed text mining and NLP solutions to mine millions of heterogeneous, multilingual procurement documents in the healthcare sector. We extract structured procurement contract data that is used to power a platform for dynamically assessing supplier risks. Our work makes unique contributions in a number of ways. First, we deal with highly heterogeneous, multilingual data and we document our approach to tackle these challenges. This is mainly based on a method that effectively uses domain knowledge and generalises to multiple text mining and NLP tasks and languages. Second, applying this method to mine millions of procurement documents, we develop the first structured procurement contract database that will help facilitate the tendering process. Second, Finally, we discuss lessons learned for practical text mining/NLP development, and make recommendations for future research and practice.
1. Introduction
Big data technologies have significantly impacted different industries in the last decade. The rapid increase in the amount of data has created unprecedented opportunities for cost reduction, better products and services, improved productivity and decision making. However, transforming data to knowledge that brings real ‘business intelligence’ remains a non-trivial, challenging task. To achieve this, data mining has been widely adopted in industries and this refers to a large range of techniques for finding patterns and correlations within data, and using such insights to predict outcomes. While data mining is often applied to structured data (e.g., database records, relational tables), it is widely acknowledged that up to 80% of ‘big data’ is unstructured, with text (e.g., financial report, product catalogues) representing the majority (Bach et al., 2019).
Extracting knowledge from unstructured textual data belongs to the field of text mining, which crosses the field of Natural Language Processing (NLP) that aims to make machines understand human language. These encompass a wide range of techniques such as text classification (Kowsari et al., 2019), named entity recognition (Bose et al., 2021), relation extraction (Bassignana and Plank, 2022), and terminology extraction (Dominika, 2021). And all these techniques may be used collectively to create structured knowledge bases (Krzywicki et al., 2016). Indeed, text mining and NLP research can date back to as early as the 1960s (Grishman and Sundheim, 1996) and has led to a vibrant community focusing on various tasks over the years. However, studies have shown that progress made in academic literature is not always adopted in the industrial, real-world-application context (Chiticariu et al., 2013; Krishna et al., 2016, Suganthan et al., 2015). This is often due to factors such as the difference in the evaluation focus, the timeframe for development, and the need for interpretability and after-maintenance.
For application to real-world problems, text mining finds a long history in analysing legal texts, such as for similar case matching, event timeline extraction, question and answering (Zhong et al., 2020), and judicial decision prediction (Francia et al., 2022). A significant body of work has also been done in mining Web content related to service/product provisions (Kuma et al., 2021), such as analysing social media data for customer relationship management, marketing intelligence, competitor analysis (Köseoğlu et al., 2021) and more recently, combating misinformation (e.g., fake reviews). Further, text mining has also been used in system requirements extraction and classification primarily for software engineering (Li et al., 2015; Khan et al, 2020; Tiun et al., 2020), and quality and project report analysis in the construction industry (Lee et al., 2014; Zhang et al., 2019; Tian et al., 2021). Recent studies by Rabuzin and Modrusan (2019) and Modrusan et al. (2020) identified an emergent need but a lack of application of text mining to procurement document analysis. Work in this area has only just taken off in recent years (Rabuzin and Modrusan, 2019; Modrusan et al., 2020; Choi et al., 2021; Fantoni et al., 2021; Haddadi et al., 2021).
Focusing on the application of text mining in real-world problem solving, we identify several gaps in the current literature. First, text mining for procurement is significantly under-represented but deserves increased attention from the research community. Public procurement represents a significant part of a government’s financial budget and is a very complicated process involving the analysis of multiple documents at both the buyer and supplier’s end. Currently, a lack of standardisation of the documentation process within and across national borders, and a lack of structured databases providing easy access to fine-grained supplier and buyer information (e.g., supplier contractual history, service/product offerings, buyer contract criteria) render the procurement process extremely time consuming and ineffective. Although some studies have looked at this area, they examined very different tasks and some lack clarity on how the end system can support real decision making (e.g., Grandia and Kruyen, 2020; Haddadi et al, 2021).
Second, we note that current studies address limited complexity in building real-world text mining applications. We describe complexity from two levels: 1) high level of heterogeneity in data, which partially leads to 2) the large range of text mining methods that need to be combined in a holistic solution. Early studies such as Chalkidis et al. (2017) investigated well-curated data sources, while work in other industrial contexts often deals with homogeneous document types ( Zhang et al., 2019; Tian et al., 2021). However, as pointed out in Modrusan et al. (2020), procurement documents are highly heterogeneous in both file format and content structure. The lack of standardisation implies a significant degree of data cleansing and in many cases, adaptation of state-of-the-art methods that are typically developed with well-curated data. This is a non-trivial process but is rarely documented in the existing literature. The complexity in the data also means that the task cannot be achieved using a single method that is often the case in the literature. Instead, a holistic approach combining multiple methods must be adopted.
Finally, a large number of studies relied on supervised methods (Chalkidis et al., 2017) that require training data and studied a single language. Training data is expensive to acquire in a business context and is often language-dependent. As a consequence, developing fully supervised methods in multilingual tasks may be infeasible for businesses. However, in many cases, businesses often possess certain forms of ‘domain knowledge’, such as domain specific vocabularies in Choi et al. (2021). The challenge is how to effectively use such resources across multiple tasks for many languages in a generalisable way. We noticed a lack of reporting on this particular problem.
This work fills these gaps through documenting an industry project (with Vamstar Ltd.) that developed text mining methods and solutions to mine large scale, heterogeneous, multilingual procurement documents in the healthcare sector (pharmaceuticals in particular), with an aim to construct a structured database of supplier contract histories. The database incorporates fine-grained supplier information that allows presenting a ‘supplier risk profile’ in terms of their capacity and credibility in fulfilling contractual terms. It also enables gauging regional supply chain capacities by aggregating individual supplier data. To the best of our knowledge, our work makes contributions to industrial text mining in several ways: 1) develops the first structured database for healthcare procurement that can facilitate the tendering process; 2) being the first to document the holistic backend text mining and NLP process involving multiple tasks working on heterogeneous, multilingual datasets. This is based on a method that effectively uses domain knowledge and can be easily generalised to multiple tasks and languages; 3) discusses lessons learned from adapting text mining research to developing real world applications in industrial contexts.
The remainder of this paper is structured as follows. We first (Section 2) introduce the domain and task studied in this work, setting the wider background for literature review, which is covered in the following section (Section 3). We then introduce our proposed solution (Section 4). Next in Section 5 we report evaluation of the components and present the end product. We further discuss lessons learned (Section 6) from this work and conclude with a reflection on the limitations and future work (Section 7).
This paper is
* indicates correspondence author