Semi Structured Data Processing

Creating Model

Inorder to create models which can be used for predictions, we need to have labeled historical data.

Pre-requistees for this phase are as follows

Historical documents (pdf / image files)
A good amount of historical documents is required for the ML algorithm. Atleast 6 to 12 months of documents are required.

ERP Extract
ERP Extract supporting the historical documents is required for extracting / labeling the various feilds that we want to predict. This file should have all the feilds we want to predict. Make sure that the most unique feild is placed as the first column of this dataset.
Eg:- If we are trying to predict Po No, Invoice No and Invoice Date, Invoice No is the most unique feild and this should be the first column in the file

BELOW ARE THE STEPS TO BE FOLLOWED TO CREATE A MODEL

Step 1:- Run PDF to CSV Convertor in JIFFY

This step will convert all the PDF/image files to a csv single csv file and upload them to Docube. Refer PDF to CSV Convertor task in JIFFY. Update the input and output parameters.The folder from which the files has to be read must be updated at … The Docube location to which the csv has to be uploaded must be updated at …

Step 2:- Upload ERP Extract on Docube

If you get these in multiple files, you can upload all the files and then create a SQL datasheet from these to create the ERP Extract.

Step 3:- Run ExtractData configuration

You can either update the existing ExtractData configuration or create new generic spark configuration using ExtractData class. If you are using the existing configuration make sure that the input and output parameters are updated. You may need to update the regular expressions and tags in RegExpCustom and RegExpGeneral files (in /TrainingCognitive/Data) based on the fields that you are predicting. Tags in RegExpGeneral and RegExpCustom must match the columns in ERP Extract datasheet.
ExtractData class will extract the relevant text from each text of your pdf/image for all the fields in ERP Extract
Eg:-
If the Invoice Date is Jan 02 2019 OR 02nd Jan 19 OR January 02 2019 , the extracted date will be 02-January-2018 00:00:00. Column name in the ExtractedData datasheet will be finalInvoiceDate
If the InvoiceAmount is 5.000,00 OR 5,000.00, the extracted amount will be 5000. Column name in the ExtractedData datasheet will be extInvoiceAmount

Step 4:- Create/Run LabelData configuration

You can either update the existing LabelData configuration or create new generic spark configuration using LabelData class. If you are using the existing configuration make sure that the input and output parameters are updated.
LabelData class will label all the relevant text in your documents for all the fields in ERP Extract. Manual labeling is not required here. LabelData class will do this for you.
Refer DEBUG folder to get the log informations.
Example:- Running /TrainindData/ModelCreation/LabelData configuration will create log files in /Debug/TrainindData/ModelCreation/LabelData.

Step 5:- Add additional features (if required)

Incase you need to add more features, you add them by using a SQL Editor by joining other datasheets

Step 6:- Create/Run CreateModel configuration

You can either update the existing CreateModel configurations or create new create model spark configuration. If you are using the existing configuration make sure that the input and output parameters are updated.
Refer DEBUG folder to get the log informations.

Docube - Cognitive & Analytic Layer

Semi Structured Data Processing

Creating Model

Automation Analytics and AI in a box

HfS Hot Vendor