Study in Progress

Extracting Data Schema from Document From Template with Deep Learning Techniques

During the decades of software development, data schema design has been a significant concern because this is the main frame of software structure. Adequate data demands an excellent structure to store it with purpose. However, these criteria require a specialist in software design to look over and extract the schema that can cover the program required. Unfortunately, this technological era makes the attendees in this field create products to serve the market quickly. So many software house companies use developers for both development and design roles. Because of this, highly customized software needed an effective database schema design. Therefore, this study will focus on how to automatically transform the prototype into a data schema using machine learning techniques. The first step of this study focuses on the data preparation process, which includes data cleaning and normalization. After that, the study will find a method to classify the types of data in the dataset. Then, in the next step, we will apply the appropriate model to extract the prepared data from the schema.

 

INTRODUCTION

The growth in popularity of software development for decades made many companies worldwide trend to invest in this area. Under market pressure, the business service for each company needs to serve their product with three factors: high quality, lower cost, and minimized time. As people know, the software house business invests a lot of money to recruit technology resources. Still, under the limited income standard, most of them need to save the cost by resource utilization, such as when one senior developer needs to have the skill of system design to combine two tasks in one unit. However, the perspectives of system analysts and developers are different because SA will look for an overview of the system business and then design the specification details. But the developer’s behavior will be designed as a partial module in the development phase. Therefore, the problem will occur if the system is run on a large scale. Some data schema did not support actual business requirements. This study aims to develop a method to extract data schema from prototype resource analysis that focuses on document format by using deep learning techniques to solve the problem with low cost and less time spent.

 

PURPOSED METHOD

  • Data Preparation

Data preparation involves significant activities or tasks, including data profiling, cleansing, integration, and transformation. Data can be transformed from a raw and unstructured form into a more valuable and structured form ready for analysis. This can lead to more accurate and reliable results, which can be used to make better decisions in various fields, including business, healthcare, and science. Additionally, the importance of data auditing and cleaning can help identify and eliminate anomalies in the data, further improving the quality of the results. This study will focus on a pdf. file or form template that we can use to store the data, then review to find the appropriate training data and prepare these data ready to be trained in the model for building the data schema automatically.

  • Data Classification

After preparing the dataset,. Text detection is required for this process by using Optical character recognition (OCR), which is groundbreaking in today’s digital environment. The essential technology that can detect and extract word from any document resource is then decided to approach EasyOCR model to apply in this study. EasyOCR has a large OCR library with more than 80 supported languages EasyOCR is a python package that holds PyTorch as a backend handler. It detects text from images but in my reference, while using it, I found that it is the most straightforward way to detect text from images.

 

EasyOCR combined complex deep learning model to generate the data, which are Character region Awareness for Text Detection (CRAFT) is deep learning-based model that provides scene text detection or bounding boxes of text in an image. These bounding boxes can then be passed to an Optical character recognition (OCR) engine for enhanced text recognition accuracy.  The CRAFT text detection model uses a convolutional neural network model to calculate region scores and affinity scores. The region score is used to localize character regions, while the affinity score is used to group the characters into text regions.

Another model be used is ResNet, which is a deep Network In order to solve the problem of the vanishing/exploding gradient, this architecture introduced the concept of Residual Blocks. In this network, we use a technique called skip connections. The skip connection connects activations of a layer to further layers by skipping some layers in between. This forms a residual block. Resnets are made by stacking these residual blocks together. 

LSTM stands for long short-term memory networks, used in the field of Deep Learning. It is a variety of recurrent neural networks (RNNs) that are capable of learning long-term dependencies, especially in sequence prediction problems. LSTM has feedback connections, i.e., it can process the entire sequence of data, apart from single data points such as images.

Connectionist Temporal Classification (CTC) is a way to get around not knowing the alignment between the input and output. As we’ll see, it’s especially well-suited to applications like speech and handwriting recognition.

  • Data Relationship

In terms of data relation, this step needs to find the connection of each data point to find the domain and attribute by applying the Word2Vec method, a natural language processing technique (NLP). Using a neural network model, the word2vec algorithm learns word associations from a large text corpus. Once trained, such a model can identify synonyms and suggest additional terms for a sentence fragment. As its name suggests, word2vec represents each distinct word as a vector consisting of a specific list of numbers. So, a basic math function called cosine similarity can show how semantically similar the words that those vectors represent are, since the vectors are chosen to show the semantic and syntactic properties of words.

Word2vec is a collection of related models used to generate word embeddings. These models are two-layer, basic neural networks trained to reconstruct the linguistic contexts of words. Word2vec accepts a large corpus of text as input and generates a vector space, typically with several hundred dimensions, where each unique word in the corpus is assigned a vector. 

Word2vec can produce these distributed representations of words using one of two model architectures: continuous bag-of-words (CBOW) or continuous skip-gram. In both architectures, word2vec iterates over the entire corpus while considering individual words and a sliding window encompassing context words. In the architecture of a continuous bag-of-words, the model predicts the present word based on the window of surrounding context words. The order of words in a context does not affect prediction (bag-of-words assumption). In the continuous skip-gram architecture, the model predicts the surrounding window of context words based on the current term. The skip-gram architecture gives more weight to local than distant context words. 

  • Data Modeling

Data modelling is the process of producing a visual representation of an information system, either in its entirety or in part, to communicate the relationships between data points and structures. The objective is to illustrate the data types used and stored within the system, the relationships between them, how data can be grouped and organized, and its formats and attributes. To create the data schema in this study, we will select the machine learning method that will be appropriate to the study and will help to generate a relation schema.

 

FUTURE WORK

In the previous, this is only a guideline overview for the student. Following the process, we will start with the preparation of data and try to run the classification process to evaluate the data, which can possibly be a data schema.

Change Your Business into Data Driven with our AI Solution

Related Post

Data-Science

The Different between machine learning (ML) and artificial intelligence (AI)

Here’s a breakdown of the differences between machine learning (ML) and artificial intelligence (AI), along with how they relate to each other: Artificial Intelligence (AI) The Broad Concept: AI is a vast field of computer science focused on creating intelligent machines that can mimic human cognitive functions like reasoning, problem-solving, perception, and learning. Goals: AI aims to build systems that can perform

Data-Science

Computer Vision: Teaching Computers to See and Understand

Sight is one of mankind’s most valued senses, allowing us to effortlessly navigate the world, learn, and interact with our surroundings. Computer Vision (CV) is a field of artificial intelligence that boldly aspires to replicate this powerful capability in machines. It’s focused on developing techniques to help computers extract meaningful information from images and videos – to go beyond just

Data-Science

Study in Progress

Extracting Data Schema from Document From Template with Deep Learning Techniques During the decades of software development, data schema design has been a significant concern because this is the main frame of software structure. Adequate data demands an excellent structure to store it with purpose. However, these criteria require a specialist in software design to look over and extract the

Data-Science

How data science will boost your profit

In today’s rapidly evolving business landscape, data has become an incredibly valuable asset. Data science is the field of harnessing that data, extracting insights, and turning those insights into profitable actions. For businesses, this can translate into substantial increases in revenue and efficiency. Here’s how: Optimised Pricing Strategies Data science can revolutionise your pricing strategies. By analyzing historical sales data,