In all domains, the amount of unstructured documents is increasing exponentially. iQC, our AI platform extracts structured data from unstructured documents

About

The amount of unstructured data in documents and other data sources such as text in social media has been increasing exponentially in every domain. Capturing key pieces of relevant information from unstructured documents and other data sources is particularly relevant to many enterprises and governments, for whom accurate decision making is both vital and time sensitive. The current technologies to capture relevant pieces of information from documents is generally limited to search, sentiment analysis, or extraction of entities such as names of people and locations, and similar technologies. Yet, most of these technologies do not get to a level of granularity that analysts and other subject matter experts (SMEs) can utilize for efficient and accurate decision making. For example, a natural language processing (NLP) system will extract names of people from documents, but it cannot be customized to find only the names of people relevant to a topic of interest and discard all other names. In addition, adapting these technologies to specific domains or foreign languages typically requires custom software development with machine learning or NLP expertise. Agile Data Decisions (AgileDD) has developed iQC, an artificial intelligence (AI) platform that Subject Matter Experts (SMEs) use to customize the capture of relevant data from their documents (PDFs, MS Office, images) and text at massive scale. It is a language-agnostic platform that allows SMEs to conduct customization through a user-friendly graphical user interface (GUI) without requiring any computer programming. iQC enables the users to provide specific examples of the information relevant to them through the GUI, for example, they can label names, locations, graphical patterns and numbers of interest in the documents. iQC then learns to identify the targets based on the context (surrounding text) where the labelled information occurs. In the GUI, the users can: 1. Upload a set of initial training documents or segments of text such as PDF reports, MS Office documents, text files, etc. iQC starts learning from as little as 20 initial documents. 2. Label data or information of interest in the training documents using a GUI, for example, label different types of data of interest, such as names of people, graphical patterns, locations, numbers, titles, etc. User can validate and do other actions to provide feedback to the system, and the machine learns from actions performed by the user on their documents. 3. Start a machine learning training with a press of a button to train models that learn to find relevant information. 4. Once the models are trained, the user submits new documents or text to be processed automatically for the capture of data. Documents can be processed on a massive scale on a computing cluster. 5. iQC is a continuous learning system such that the user can review the results of automatic capture and declare specific items as correct and incorrect, and iQC continuously learns from domain expert feedback. iQC provides several competitive advantages: 1. iQC can intake a wide variety of document types and text streams and it does not depend on form templates. In fact, models trained on plain text can be applied to documents and vice versa. 2. iQC is based entirely on learning from human domain experts. This makes iQC well-suited to difficult customer problems that require intensive customization and allows iQC to be trained on documents of any language. 3. iQC enables building highly focused models that learn from human expertise to capture data from relevant pieces of text. For example, instead of extracting all the names of people from a document, iQC can learn from textual context to capture the only the relevant names as specified by the SME during the training stage. 4. iQC is an integrated system that captures both graphical and textual data from documents while learning both types of captures from domain experts in a uniform graphical interface by combining natural language and computer vision methods. 5. iQC can be trained with a GUI and then process documents at a massive scale on a cluster.

Key Benefits

Our customers can derive valuable information from their legacy datasets or opensource documents at scale, hence add more data points in their data based decision processes and reduce the associated risk. Not only it reduces the cost of a manual data extraction but also allows to apply the information extraction at scale on cloud servers. Less human resources are necessary for data mining, allowing allocating more resources for data analysis.

Applications

- In the Oil and Gas industry, we process well and seismic documents to extract key information for exploration. - In the Defense domain, we process documents and forms related to flight debriefing - In the mining domain, we extract assay tables, geological logs and drill holes key values and geophysical surveys to improve subsurface models.

Register for free for full unlimited access to all innovation profiles on LEO

  • Discover articles from some of the world’s brightest minds, or share your thoughts and add one yourself
  • Connect with like-minded individuals and forge valuable relationships and collaboration partners
  • Innovate together, promote your expertise, or showcase your innovations