ETL Process
- Access data dictionaries defining source files
- Identify sources of data from existing systems
- Build logical and physical data models for target data
- Specify business and technical rules for data extraction, conversion and transformation
- Perform data extraction and transformation and load target databases
Push vs. Pull
Pull :- A Pull strategy is initiated by the Target System. As a part of the Extraction Process, the source data can be pulled from Transactional system into a staging area by establishing a connection to the relational/flat/ODBC sources.
Advantage :- No additional space required to store the data that needs to be loaded into to the staging database
Disadvantage :- Burden on the Transactional systems when we want to load data into the staging database
Push :- A Push strategy is initiated by the Source System. As a part of the Extraction Process, the source data can be pushed/exported or dumped onto a file location from where it can loaded into a staging area.
Advantage :- No additional burden on the Transactional systems when we want to load data into the staging database
Disadvantage :- Additional space required to store the data that needs to be loaded into to the staging database
ETL Process Flow
Static extract = capturing a snapshot of the source data at a point in time
Incremental extract = capturing changes that have occurred since the last static extract
Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies
Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data
Record-level:
- Selection – data partitioning
- Joining – data combining
- Aggregation – data summarization
- Single-field – from one field to one field
- Multi-field – from many fields to one, or one field to many
Refresh mode: bulk rewriting of target data at periodic intervals
Update mode: only changes in source data are written to data warehouse