11/23/16

ETL Testing – data warehouse testing questions and Answers

What is a data warehouse?
A data warehouse is a database which,
1.Maintains history of data
2.Contains Integrated data (data from multiple business lines)
3.Contains Heterogeneous data (data from different source formats)
4.Contains Aggregated data
5.Allows only select to restrict data manipulation
6.Data will be stored in de-normalized 
format

Definition of a data warehouse:
1. Subject-oriented
2. Integrated
3. Non-volatile
4. Time-Variant
Main Usage of a data warehouse:
1. Data Analysis
2. Decision Makings
3. Planning or Forecasting
What is a dimension?
A Dimension table is a table where it contains only non-quantifying data and category of information which are key for analysis. A dimension table contains primary key and non-quantifying columns. If the primary key does not exist in source table then surrogate key would exist.
What are the types of dimension?
Based on what type of data it stores there is two major types dimension table,
1.Confirmed dimension
2.Junk dimension
Based on where it’s being derived there is one dimension category,
3.Degenerated dimension
Based on how frequently the data in the dimension can be divided into 2 types,
4.Rapidly Changing Dimension (RCD)
5.Slowly Changing Dimension (SCD) 
What is a fact and what are the types of fact?
A fact is a column or attribute which can be quantifiable or measurable and will be used as key analysis factor. We can call it as a measure.
Types of Fact:
1. Additive
2. Semi-additive
3. Non-additive
What does a fact table contain?
A table which contains facts is called fact table. Typically a fact table has facts and foreign keys of dimension tables.
Fact table structure:
Foriegn_key1
Forign_keyN
Fact1
FactN
What are the types of a fact table?
Transactional
The fact table will contain data’s in very detail level without any rollup/aggregation the way how transactional database stores.
Accumulating
Accumulating refers storing multiple entries for a single record to track the changes throughout the workflow.
Periodic snapshot
The data will be extracted and loaded for a particular period of a time. It describes what would be the state of the record in that specific period.
Factless fact table
When a fact table does not have any fact is called Factless fact table. It has only foreign keys of dimension tables.
Why staging table is required?
  1. To reduce the complexity of Job (It will be more complex when we move directly from Source to Target)
  2. To avoid the source database update.
  3. To perform any calculations.
  4. To perform data cleansing process as per business need.
  5. When the data has been corrupted in Target after the load, we can delete the corrupted data in Target database after that we can just load the unloaded/deleted data alone into Target from staging database.
What is a surrogate key?
In most of the table, the  primary key will be loaded from source schema, but some source table might not have a primary key in such has by using sequence generator the primary key will be created, such keys are called Surrogate key.
In terms of usage, there is no difference between these two types of keys. Both differ in the way of loading primary key loaded from the source table, whereas surrogate key loaded by the sequence generator.
OLTP vs DW database
OLTPDW
Dedicated database available for specific subject area or business applicationIntegrated from different business applications
It does not keep  historyIt keeps history of data for analyzing past performance
It allows user to perform the below DML operations (Select, Insert, Update,Delete)It allows only Select for end users
The main purpose is for using day to day transactionsPurpose is for analysis and reporting
Data volume will be lessData volume is huge
Data stored in normalized formatData stored in de-normalized format

Explain about star schema
Operational Data Store (ODS) 
vs Staging database
ODSStaging
It will have limited period of data (30 to 90 days)Based on type of load it stores incremental data or full volume of data
Operational processingTemporary data storage and for doing data cleansing and other calculations
Integrated from different business linesBased on business need, normally the each business line would have dedicated staging
This type schema contains the fact table in center position. As we know that fact table contains a reference to dimension tables. Then the fact table will be surrounded by dimension tables with foreign key reference. The dimension table will not have a reference with any other dimension.
Explain about snowflake schema
This type also contains a fact table in center position. The fact table has a reference to dimension tables. The dimension table will have a reference to another dimension. The data will be stored in the more normalized form.
What is the difference between star and snowflake?
StarSnowflake
As there is no relationship between dimensions to other dimensions the performance will be high.Due to multiple links between dimensions the performance will be low.
The number of joins will be less which makes query complexity lowThe number of joins will be more which makes query complexity high
Consider the Project dimension mentioned in above example it has Role column where the Role name value will be stored against for each project in case of start schema, the size of the table will be highThe role information is separately stored in a table and the reference will be linked in Project dimension, it reduces the table size
Data will be stored in de-normalized format in dimension tableData will be stored in more normalized format in dimension tables
What is data cleansing?
Data cleansing is a process of removing irrelevant and redundant data, and correcting the incorrect and incomplete data. It is also called as data cleaning or data scrubbing. All organizations are growing drastically with huge competitions, they take business decisions based on their past performance data and future projection
What is data masking?
What does data masking mean? Organizations never want to disclose highly confidential information into all users. All sensitive data will be restricted to access in all environments other than production. The process of masking/hiding/encrypting sensitive data is called data masking.
Why Data mart?
  1. The data warehouse database contains integrated data for all business lines, for example, a banking data warehouse contains data for all saving, credit and loan accounts databases.
  2. The reporting access level will be given to a person who has authority or needs to see the comparison of data for all three types of accounts.
  3. Meanwhile, a loan account branch manager does not require to see the saving and credit card details, he wants to see only the past performance of loan account alone.
  4. In that case for his analysis, we need to apply data level security to protect saving and credit information’s data warehouse.
  5. At the same time, the number of end users across three accounts will access the same data warehouse, it will end up in poor performance.
  6. To avoid these issues, the separate database will be built on top of data warehouse, named as the data mart. The access will be given for respective business line resources not for everyone.
What is data purging and archiving?
Data purging means deleting data from a database which crosses the defined retention time.
Archiving means moving the data which crosses the defined retention time to another database (archival database).
What are the types of SCD?
SCD Type 1
-Modifications will be done on the same record
-Here no history of changes will be maintained
SCD Type 2
-An existing record will be marked as expired with is_active flag or Expired_date column
-This type allows tracking the history of changes
SCD Type 3
-A new value will be tracked as a column
-Here history of changes will be maintained
What type of schema and SCD type used in your project?
In my current project, we are using type2 to keep the history of changes.

Google Q&A Forum