Data Aggregation
SQL
- Open the SQL Editor in the desired folder.
- Type the SQL query in the editor with the appropriate aggregation condition (eg: Group by, SUM etc).
- Once the query is composed, run it.
- Publish the resulting datasheet with a new name.
Using Spark Jobs
- Develop the Spark code with Spark and SQL contexts. The SQL query is to be written in the SQL context, as a dataframe in the output file.
- Upload the jar file built from the Spark job, in a custom configuration as a Resource file.
- Choose the right class and the Spark cluster.
- Under “Arguments” of the configuration, enter the input datasheet, if any, and specify a name for the output sheet where you want the resulting table to be stored.
- Save the configuration.
- Run the Spark job.
- Find the output file created in the name specified. View the resulting table here.
Snippets of the Spark code may be found in the following picture, illustrating the reading of the input file to be stored in dfile1, the SQL query written in the “Processing” section of the code and writing to the output file.
