2/8/14

ETL Basics

What is ETL?
  • ETL stands for extract, transform and load are a three-stage process in database usage and data warehousing.
  • It enables integration and analysis of the data stored in different databases and various formats.
  • After it is collected from multiple sources the data is reformatted and cleaned for operational needs. Finally, it is loaded into a target database.
  • ETL Tools used to move data from one operational system to another 
  • Comprehensive testing of a data warehouse at every point throughout the ETL process is becoming increasingly important as more data is being collected and used for strategic decision-making. 
  • ETL testing is often initiated as a result of mergers and acquisitions, compliance, regulations and data consolidation.
  • ETL usually determines the success or failure of a Data warehouse because a lot of importance on data and the quality of data that is being analyzed
E - Extraction
  • The process of pulling out data that is required for the Data Warehouse from the source system
  • The purpose of the extraction process is to reach to the source systems and collect the data needed for the transformation.
  • Usually data is consolidated from different source systems that may use a different data organization or format so the extraction must convert the data into a format suitable for transformation processing.
  • The complexity of the extraction process may vary and it depends on the type of source data.
  • The extraction process also includes selection of the data as the source usually contains redundant data.
  • Extraction can be automated since it becomes repetitive once established
T - Transformation
  • Transformation is the process of making data into reformatted and cleaned for operational needs.
  • Transformation usually done after Extraction with the purpose of ensuring ‘clean’ and ‘consistent’ data
  • The transformation stage involves applying a series of rules or functions to the extracted data. 
  • Transforming the data in accordance with the business rules and standards that have been established
  • Example include:  format changes, de-duplication, splitting up fields, replacement of codes, aggregates
  • Transformation includes validation of records and their rejection if they are not acceptable in integration part. 
  • Transformation is required to convert and summarize operational data into a consistent, business oriented format
  • The amount of manipulation needed for transformation process depends on the data.
  • The most common processes involved in transformation are conversion, clearing the duplicates, standardizing, filtering, sorting, translating and looking up or verifying if the data sources are inconsistent.
Reasons for “Dirty” data

 1) Dummy Values
 2) Absence of Data
 3) Duplicate Data
 4) Inappropriate Use of Address Lines
 5) Violation of Business Rules
 6) Non-Unique Identifiers

Clean data is vital for the success of the warehouse.

Cleansing
=> The process of removing errors and inconsistencies from data being imported to a data warehouse
=> Cleansing could involve in multiple stages


L - Loading
  • Loading is the process of transfer data to the warehouse after extracting, Transforming, cleaning and validating 
  • The loading is the last stage of ETL process
  • It loads extracted and transformed data into a target repository. 
  • One of the ways to load ETL data is physically insert each record as a new row into the table of the target warehouse involving SQL insert statement 
  • The loading part is usually a bottleneck of the whole process. 
  • Issues while loading
  • Huge volumes of data to be loaded
  • Small time window available when warehouse can be taken off line (usually nights)
  • Allow system administrators to monitor, cancel, resume and change load rates
  • After failure, While recovery needs restart from where you were and without loss of data integrity
The Need for ETL
  • ETL facilitates Integration of data from various data sources for building a Data Warehouse 
  • Businesses have data in multiple databases with different formats
  • Makes data available in a queriable format
Data Integrity Problem Scenarios
  • Same person, different spellings
Eg. Agarwal,  Agrawal,  Aggarwal etc.
  • Multiple ways to denote company name
Eg. Persistent Systems, PSPL, Persistent Pvt. LTD.
  • Use of different names
Eg. Mumbai, Bombay
  • Different account numbers generated by different applications for the same customer
  • Required fields left blank
  • Invalid product codes collected at point of sale
  • Manual entry leads to mistakes
Characteristics of ETL Tools
  • Provides facility to specify a large number of transformation rules with a GUI
  • Generate programs to transform data
  • Handle multiple data sources
  • Handle data redundancy
  • Generate metadata as output
  • Support data extraction, cleansing, aggregation, reorganization, transformation and load operations
  • Generate and maintain centralized metadata
  • Closely integrated with various RDBMS
  • Filter data, convert codes, calculate derived values, map many source data fields to one target data field
  • Automatic generation of data extract programs
  • High speed loading of target data warehouses
ETL Tool - General Selection criteria
  • Business Vision/Considerations
  • Overall IT strategy/Architecture
  • Over all cost of Ownership
  • Vendor Positioning in the Market
  • Performance
  • In-house Expertise available
  • User friendliness
  • Training requirements to existing users
  • References from other customers







0 comments:

Post a Comment