ETL in the Cloud
What is ETL?
ETL (Extract Transform Load) is a data-warehousing term for working, transforming and finally loading data .
In TimelinePI ETL is a feature where a user can upload small or large files or multiple files, zipped or unzipped and once uploaded they can work with the uploaded file as if they would browse a database table. That means the user can run operations on the uploaded file. For instance: joining two columns into one or trimming etc.. After transforming the data the user can choose to load the transformed table into a project or multiple projects. ETL is a feature for advanced and big-data uploads. Advanced, because compared to the regular file upload, we do not run type detection automatically and we do not create timelines from the data, so the user needs to understand the data. User can upload compressed or big data files and then work on the raw data before creating timelines within a project.
How it Works
A repository is an abstract container which is similar to a project. In a repository TimelinePI holds tables and a table is representing one or more data files which the user uploaded. A user is able to have more than one repository and each repository may have multiple tables. The user can switch between repositories and also a repository could have more users just like a project. A repository owner can invite users to collaborate on a repository.
A repository table is an actual database table but every repository table is assigned to a repository only. In the most simple case a user uploads a CSV file into a repository and TimelinePI create a database table with the contents of the CSV file and then it becomes a repository table.
When the user switches to the table view they see a preview about the raw data table. Preview = 1000 rows. On this view the user can define multiple operations. By clicking on the button “Add operation” and can choose an operation from a list. After choosing one, i.e. “trim” user can configure the operation by adding further conditions. When they finish setting up the operation they can opt-in for a preview about the operation.
When the user chooses to see a preview, TimelinePI will run that operation on a temporary data table. The user will still have the original raw data table and see a temporary data table which is only 1000 lines long. The operation previews on this temp table and it resets after every operation preview.
Each operation will define a dynamically built query which will be able to run against the raw data table as well as on the preview temp table. Once these operations have been executed against the raw data table they are irreversible, akin to running a SQL statement on the table.
- Change case – Converts the field value to UPPER or lower case. For strings only.
- Change type – Converts the selected field into specified type.
- Combine timestamp – Creates Timestamp field from separate Date and Time fields.
- Create timestamp – Creates timestamp field from a text field using format expression.
- Date diff – Creates a new field by calculating the difference between two dates.
- Date add – Add or subtract a date part from a date.
- Delete – Delete records based on a criteria or all.
- Delete by timelines – Delete timelines based on a criteria.
- Delete duplicates – Delete duplicated records.
- Derive field – Creates a new field by combining several fields and the fixed text.
- Join – Adds field from another (child) table to parent table.
- Load into project – Loads the timelines to a new or existing project.
- Remove substring – Removes the specified substring from the field value. For strings only.
- Replace substring – Replaces the specified substring with another substring. For strings only.
- Round timestamp – Rounds the timestamp field to the specific units (second, minute, etc.).
- Transpose – Convert single record with multiple selected fields into multiple records.
- Trim – Removes extra spaces on left and right”
As part of the T (transform) in ETL the user can perform various transformations on the raw data table. A common need is to concatenate two columns into one single column. for example concatenating a column which has data for a Date without Time with a column which contains a Time. The resulting column is a DateTime field type which is a required data field by TimelinePI.
Once a user has made some operations on the raw data table they can choose to save the sequence of those operations as a to-do-list a.k.a operation history for later re-use on the same file or others which require the same operations.
Users may upload:
- A single CSV file
- Multiple CSV files zipped (compressed, road-mapped to later support gzip)
- Single CSV file zipped
A repository may have any number of users and a user could be a part of multiple repositories.
User’s rights in a project could be one of the following:
- Data master – can view the data AND upload new data
- Admin – can view the data AND upload new data AND add/remove other users and change their rights (except for Owner)
- Owner – a person who created the project. There’s only one owner for a project at a time. The ownership could be transferred to another user. An owner has all rights of Admin.
One the current owner could see item Owner in the drop-down list. If he sets this value to another user, the panel comes up “Are you sure you want to transfer the ownership to another user? You won’t be able to take it back. Yes/No”. If user says Yes, they then become Admin and the other user becomes Owner.
Any admin or owner could make another user an admin or remove the admin right from them, except for the Owner. An admin changes the rights of other users by clicking on the drop-down lists in the user grid.
Admin could also add create another user to the project by typing their email and clicking Enter. In such case: New user is created with rights “Data Master”.
If the user already has account in TimelinePI, he simple gets email notification about access to a new project.
If user doesn’t have an account, we create new account in Recovery status, then send user an email with the temporary password – exactly the same process as when user recovers the forgotten password – see User Management document.
This is a person who has access to Upload panel and can upload new data.
Here is where the log of all changes to the repository will be recorded. This means all new users, deleted users and the addition and removal of tables from the repository will be documented here.