1. What is Amorphic Data ETL?
It’s an extract, transform, and load (ETL) service that automates the time-consuming steps of data preparation for analytics.
2. What can I do with Amorphic Data ETL service?
You can generate ETL code to transform your source data into target schemas, and run the ETL jobs on a fully managed, Apache Spark environment to load your data into its destination. It uses AWS Glue on the backend and thus it automates the setting up and management of the clusters for an Apache Spark environment.
3. How do I get started with Amorphic Data ETL capabilities?
To get started with CDAP ETL, login to Amorphic Data and you can create a new job by providing some AWS Glue related infrastructure information. The job will let you edit your ETL script on the Amorphic Data.
4. What if I have my own ETL script?
Upon creation of a job, you can bring your own AWS Glue ETL script in the platform using the Upload script functionality. Keep in mind that AWS Glue supports python and spark environment.
5. How is the ETL component integrated with the rest of Amorphic Data services?
The ETL jobs created in the Amorphic Data can allow you to perform ETL on a dataset created in the CDAP Dataset platform on a standalone basis or can help you create preprocessing and post-processing ETL jobs for purpose of running a ML model.
6. How can you run ETL job in a Dataset in Amorphic Data?
Every Dataset in Amorphic Data provides you the connection information (ie. s3 location) which you can use inside the ETL script to run ETL job in standalone Datasets.
7. How can you use ETL jobs when running ML model in Amorphic Data?
In model creation process, you specify the preprocessing and post-processing ETL jobs in Amorphic Data. The preprocessing and post-processing ETL jobs will run on the respective Dataset matching the input and output schema as specified during the model creation process.
8. How do I get started with creating ETL scripts in Amorphic Data?
Any ETL job by default comes loaded with a set of libraries and spark context that you can use to develop your ETL script. You can add on more code in the script as per your ETL job requirement.
9. How do I access input dataset and save in output dataset in Amorphic Data ETL service?
By default, every script accepts three arguments:
- originalFileObjectKey: This is useful when running post processing job with an ML model and you would like to access the name of original file on which the model was run.
- inputLocation: This is useful when you would like to access or set the input location of the dataset on which the ETL job is run.
- outputLocation: This is useful when you would like to access or set the output dataset location in which the ETL jobs will save the results.
For the purpose of ML model, the above parameters can be useful in identifying or setting the input and output location of data in preprocess and post process state respectively.
10. Can I call the input datasets from AWS Glue Catalog?
Since the Datasets are also registered as AWS Glue Catalog, you can create Spark Dataframes from the AWS Glue Catalogs inside the ETL scripts.
11. What are additional functionalities provided by the Amorphic Data ETL service?
Apart from a sample code, the ETL service also provides a sample code generation feature which generates a code snippet corresponding to the transformation selected.
12. How do I provide ETL job run related information?
Upon clicking on the ETL job, you can click on Job Actions > Edit Job Detail to enter AWS Glue execution related information like Allocated Capacity etc. Since the service runs on AWS Glue, you can further look into AWS Glue documentation for information relating to job tuning.
13. How do I execute an ETL job?
In the ETL job page, there is an execution functionality (left to “Job Actions”). Upon clicking, you provide the job-related information and click submit. You will be redirected to the Execution page which will track the status of your ETL application.
14. Why is my Execution status for ETL job run not refreshing?
You can click refresh on individual job run to know the current status of your job run.
15. What happens after an ETL job is run successfully?
On successful ETL job run, the result data is ingested to the respective Dataset in the Amorphic Data Dataset corresponding to the script and the output location. The Dataset Redshift tables gets appended with new data in the backend.