You can create new Datasets in Amorphic by using the “New Dataset” functionality of Amorphic application.

In order to create a new Dataset, you would require information like Domain, Connection type and File Type etc. Following are the main information required to create a new dataset.
Domain
Each Dataset is registered to a Domain. The Dataset domains need to be created in the Amorphic Administrative options.
Connection Type
Select connection type as JDBC for JDBC connection, S3 for an s3 connection while API for all other types of connection.
A JDBC connection type will require you to select a JDBC connection from a list of Amorphic Connections (see connection section). You will also need to specify the table name from which the scheduler will run the data ingestion job.

While for an S3 connection, you will need to specify a S3 connection and the path of the directory on which a schedule will poll for new datasets on an on-demand or on a time basis. (check schedule section)

File Type
This file type should be in sync with the ML Model supported file format option.
Apart from the various supported formats, you can also perform metadata extraction from the unstructured dataset using auto ML functionalities which are integrated with AWS Transcribe and Comprehend services in the back end. (More details in ML Model Section)
Target Location
This can be s3 or Redshift. Amorphic can ingest the data into a Redshift warehouse. Redshift Data Sets require a schema file upload. You can access the connection details information once the dataset has been created. S3 Datasets do not require a schema file upload.
Target Location S3 Dataset

Target Location Redshift Dataset

As you can see from the above figure, there is “DWH Connection Details” for Redshift Datasets. This connection detail along with the “DW Credential” information (Group Management Section), can be used to connect to external business intelligence tools, for example, Quicksight for the purpose of visualization.
Keywords
This keyword information will be required to search through the Dataset listing page for the necessary Dataset. These keywords act as tags of the Dataset for search related purposes.
Table Update Method
You can select Append data or Latest Record. Amorphic being a data lake is an append-only immutable model. You cannot delete an uploaded file but it can be used to retrieve only the latest file by selecting the Latest Record table update method.
Schema File Upload
This functionality provides automated schema extraction for structured data.
Datasets with Redshift as target location require a sample schema file to be uploaded. The application will automatically try to recognize the schema of the Dataset. This schema will be used to create the respective tables in the Redshift warehouse. The schema file can be a sample CSV data of a training dataset or any other dataset.

Once the schema file is uploaded, you can validate and edit the schema and publish the new Dataset.

Upload File
Once a dataset is created and schema is validated (if required), you can upload files to a Dataset manually or on a scheduled basis using a scheduler to ingest the data through a JDBC connection or an S3 connection.

You can easily transfer the original Datasets or the result Datasets of each analysis stage of each stage in and out of platform as your business requirements dictate. This can be done using the Dataset download functionality integrated into the system.