Advance Data Transformations
Owing to simplification of relations of data, information can be spread over different tables. Several operations on these tables can be performed to get relevant data in the form as we wish. The following section illustrates this with the example of “Left join” in different methods, to extract SalesRepresentative related data.
CSV
- Navigate to the SQL Editor using the ‘+’ Symbol in the Docube interface.
- Select a CSV file, (“SalesRepresentative” in our case) and choose another from the panel on the left.
- Click on the “more” options to preview and join the other CSV file. Upon selecting the “Join”, provide information related to the type of join(Left Join) and the columns to be matched(Product).
- Preview the new table and publish it with a new name.
SQL
- Navigate to the SQL Editor in the desired folder.
- Type the SQL query in the editor.
- Click on the play button to run the query.
- Publish the datasheet with a name for the datasheet.
Using Spark Jobs
- Write and build the Spark code. Be sure to declare Spark and SQL contexts. To code the transformation, the SQL query is to be written in the SQL context, which can be written as a dataframe in the output file. (A sample Spark code is provided - SampleProgram, for reference).
- To run a Spark job, a custom configuration must be created. In this custom configuration, upload the jar file, as a Resource file; choose the right class and the Spark cluster.
- In the “Arguments” tab of the configuration, enter the input datasheet, if any, and specify a name for the output sheet where you want the resulting table to reside.
- Save the configuration.
- Click on the Play button to run the Spark job. The Spark console can be used to monitor the status of the job and view verbose error messages if required.
- View the output table in the specified output file.
The following video demonstrates adding a new table through Spark jobs.