What is a Data warehouse
(DW)?
A data
warehouse is a subject-oriented,
integrated, nonvolatile, time-variant collection
of data in support of management's decisions.
-
WH Inmon
A data warehouse
is a relational database that is designed for query and analysis rather than
for transaction processing. It usually contains historical data derived from
transaction data, but it can include data from other sources. It separates
analysis workload from transaction workload and enables an organization to
consolidate data from several sources.
A data warehouse is a repository of an organization's
electronically stored data. Data warehouses are designed to facilitate
reporting and analysis
This definition of the data warehouse focuses on data storage. Data warehouse
arises in an organization’s need for reliable, consolidated, unique and
integrated reporting and analysis of its data, at different levels of
aggregation.
The practical reality of most organizations is that their data
infrastructure is made up by a collection of heterogeneous systems. For
example, an organization might have one system that handles
customer-relationship, a system that handles employees, systems that handles
sales data or production data, yet another system for finance and budgeting
data etc. In practice, these systems are often poorly or not at all integrated
and simple questions like: "How much time did sales person A spend on
customer C, how much did we sell to Customer C, was customer C happy with the
provided service, Did Customer C pay his bills" can be very hard to
answer, even though the information is available "somewhere" in the
different data systems.
Characteristics of Data Warehouse:
A common way of
introducing data warehousing is to refer to the characteristics of a data
warehouse as set forth by William Inmon:
Ø
Subject
Oriented
Ø
Integrated
Ø
Nonvolatile
Ø
Time
Variant
Subject Oriented:
Data warehouses are designed to help you analyze data. For example, to
learn more about your company's sales data, you can build a warehouse that
concentrates on sales. Using this warehouse, you can answer questions like
"Who was our best customer for this item last year?" This ability to
define a data warehouse by subject matter, sales in this case makes the data
warehouse subject oriented.
Integrated:
Integration is closely related to subject orientation. Data warehouses
must put data from disparate sources into a consistent format. They must resolve
such problems as naming conflicts and inconsistencies among units of measure.
When they achieve this, they are said to be integrated.
Nonvolatile:
Nonvolatile means that, once entered into the warehouse, data should
not change. This is logical because the purpose of a warehouse is to enable you
to analyze what has occurred.
Time Variant:
In order to discover trends in business, analysts need large amounts of
data. This is very much in contrast to online transaction processing (OLTP)
systems, where performance requirements demand that historical data be moved to
an archive. A data warehouse's focus on change over time is what is meant by
the term time variant.
OLTP vs. OLAP:
|
OLTP Or Operational Systems
|
OLAP Or Data Warehouse
|
Full Form
|
Online
Transactional Processing
|
Online
Analytical Processing
|
Purpose
|
OLTP
systems are used to carry the day-to-day business operations.
|
OLAP
is used for the analysis of the business.
|
Source of Data
|
OLTP’s
are the original source of data
|
OLAP
will get the data from the various OLTP Databases.
|
Type of Data
|
It
will have the detailed data
|
It
will have detailed and summarized data
|
End Users
|
Customers
|
Managers,
Departmental Heads & Board of Directors.
|
No of End users
|
Millions
|
Hundreds
to Thousands
|
Typical operations
|
Select,
Inserts, Updates & Delete
|
Read
Only ( i.e. Select)
|
Processing Speed
|
Very
fast (Response time will be in ms)
|
Depending
on the amount of data involved. Usually it will be slow.
|
Access Frequency
|
Large
number of customers will access the OLTP applications for every second.
|
Rarely
the OLAP databases are accessed by the managers. Example once in a day or
monthly.
|
Volatile
|
In
OLTP applications the data will be changing very fast (i.e. Volatile data)
|
In
OLAP once the data is entered it is not going to change (i.e. Non Volatile
Data)
|
Historical Data
|
OLTP’s
will hold maximum of 30 to 90 Days of data.
|
OLAP
will hold 3 to 5 Years of data.
|
Space Required
|
Need
relatively less amount of space.
|
Need
Large amount of space required since data warehouses will store historical
data
|
DB Design
|
Database
will be in Normalized form.
|
Database
will be in Denormalized form.
|
DB Schema
|
E-R
Model
|
Dimensional
Model
|
Redundant Data
|
Less
duplicate data
|
More
Duplicate data
|
Indexes
|
Few
|
Many
|
Data Warehousing Objects:
Fact tables and dimension tables are
the two types of objects commonly used in dimensional data warehouse schemas.
FACT
TABLE:
Fact tables are the large tables in
your warehouse schema that store business measurements. Fact tables typically
contain facts and foreign keys to the dimension tables. Fact tables represent
data, usually numeric and additive, that can be analyzed and examined. Examples
include sales, cost, and profit.
Fact
table consists of the measurements, metrics or facts of a business
process. It is often located at the centre of a star schema
or a snowflake schema, surrounded by dimension
tables.
DIMENSION TABLE:
Dimension tables, also known as lookup
or reference tables, contain the relatively static data in the warehouse.
Dimension tables store the information you normally use to contain queries.
Dimension tables are usually textual and descriptive and you can use them as
the row headers of the result set. Examples are customers or products.
The primary
function of dimensions is threefold: to provide filtering, grouping and
labeling. For example, in a data warehouse where each person is categorized as
having a gender of male, female or unknown, a user of the data warehouse would
then be able to filter or categorize each presentation or report by either
filtering based on the gender dimension or displaying results broken out by the
gender.
Each
dimension in a data warehouse may have one or more hierarchies applied to it.
For the "Date" dimension, there are several possible hierarchies:
"Day > Month > Year", "Day > Week > Year",
"Day > Month > Quarter > Year", etc.
Hierarchy:
A
logical structure that uses ordered levels as a means of organizing data. A
hierarchy can be used to define data aggregation; for example, in a time
dimension, a hierarchy might be used to aggregate data from the Month level to
the Quarter level, from the Quarter level to the Year level. A hierarchy can
also be used to define a navigational drill path, regardless of whether the
levels in the hierarchy represent aggregated totals or not.
Level:
A
position in a hierarchy. For example, a time dimension might have a hierarchy
that represents data at the Month, Quarter, and Year levels.
TYPES
OF FACTS
Additive
- Measures that can be added across all dimensions.
Non
Additive - Measures that cannot be added across all
dimensions.
Semi
Additive - Measures that can be added across few
dimensions and not with others.
Let
us use examples to illustrate each of the three types of facts.
The first example assumes that we are a retailer, and we have a fact table with
the following columns:
Date
|
Store
|
Product
|
Sales_Amount
|
The purpose of this table is to record
the sales amount for each product in each store on a daily basis. Sales_Amount
is the fact. In this case, Sales_Amount is an additive fact, because you can
sum up this fact along any of the three dimensions present in the fact table --
date, store, and product. For example, the sum of Sales_Amount for all 7 days
in a week represent the total sales amount for that week.
Say we are a bank with the following
fact table:
Date
|
Account
|
Current_Balance
|
Profit_Margin
|
The purpose of this table is to record
the current balance for each account at the end of each day, as well as the
profit margin for each account for each day. Current_Balance and Profit_Margin
are the facts. Current_Balance is a semi-additive fact, as it makes sense to
add them up for all accounts (what's
the total current balance for all accounts in the bank?), but it does not make
sense to add them up through time (adding up all current balances for a given
account for each day of the month does not give us any useful information).
Profit_Margin is a non-additive fact, for it does not make sense to add them up
for the account level or the day level.
FACTLESS
FACT TABLE:
A
factless fact table is a fact table that does not contain any facts (i.e.,
metrics). Factless fact tables are used to track events that did, or did not,
happen.
Use
a factless fact table to track events of interest to the organization. For
example, attendance at a cultural event can be tracked by creating a fact table
containing the following foreign keys (i.e., links to dimension tables): event
identifier, speaker/entertainer identifier, participant identifier, event type,
date. This table can then be queried to find out information, such as which
cultural events or event types are the most popular.
Factless
fact tables can also be used to track events that did not happen. Using the
example above we can also identify the events that were not attended or the
event types that were the least popular.
Factless fact table
captures the many-to-many relationships between dimensions, but contains no
numeric or textual facts. They are often used to record events
Coverage information. Common examples
of factless fact tables include:
- Identifying product promotion events
(to determine promoted products that didn’t sell)
- Tracking student attendance or
registration events
- Tracking insurance-related accident
events
- Identifying building, facility, and
equipment schedules for a hospital or university.
TYPES
OF DIMENSIONS
Conformed
Dimension:
Dimensions are conformed when they are
either exactly the same (including keys) or one is a perfect subset of the
other. Most important, the row headers produced in the answer sets from two
different conformed dimensions must be able to match perfectly.
Conformed dimensions are either
identical or strict mathematical subsets of the most granular, detailed
dimension. Dimension tables are not conformed if the attributes are labeled
differently or contain different values. Conformed dimensions come in several
Different flavors. At the most basic
level, conformed dimensions mean the exact same thing
With every possible fact table to
which they are joined. The date dimension table connected to the sales facts is
identical to the date dimension connected to the inventory facts.
Junk
Dimension:
A junk dimension is a convenient
grouping of typically low-cardinality flags and indicators. By creating an
abstract dimension, these flags and indicators are removed from the fact table
while placing them into a useful dimensional framework.
Junk dimensions are
dimensions that contain miscellaneous data (like flags and indicators) that do
not fit in the base dimension table.
When developing a
dimensional model, we often encounter miscellaneous indicators and flags that don't
logically belong to the core dimension tables. These unattached attributes are
usually too valuable to ignore or exclude. Designers sometimes want to treat
them as facts (supposed textual facts) or clutter the design with numerous
small dimensional tables. A third, less obvious but preferable, solution is to
incorporate a junk dimension as a holding place for these flags and indicators.
An interesting use for a junk
dimension is to capture the context of a specific transaction. While our common,
conformed dimensions contain the key dimensional attributes of interest, there
are likely attributes about the transaction that are not known until the
transaction is processed.
For example, a
healthcare insurance provide may need to capture the context surrounding their claims
transactions. The grain for this key business process is one row for each line
item on a claim. Due to the complexities of the healthcare industry, similar
claims may be handled quite differently. They may design separate junk
dimensions to capture the context of how the claim was processed, how it was
paid, and the contractual relationship between the healthcare providers at the
time of the claim.
There are two
approaches for creating junk dimensions. The first is to create the junk
dimension table in advance. Each possible, unique combination generates a row
in the junk dimension table. The second approach is to create the rows in the
junk dimension on the fly during the extract, transformation, and load (ETL)
process. As new unique combinations are encountered, a new row with its
surrogate key is created and loaded into the junk dimension table.
If the total number
of possible rows in the junk dimension is relatively small, it is probably best
to create the rows in advance. On the other hand, if the total number of
possible rows in the junk dimension is large, it may be more advantageous to
create the junk dimension as unique rows are encountered. One of the junk
dimensions encountered in
The recent healthcare design had over
1 trillion theoretical rows, while the actual number of observed rows was tens
of thousands.
IT Online Training @ Monstercourses
Obviously, it did
not make sense to create all the theoretically possible rows in advance. If the
number of rows in the junk dimension approaches or exceeds the number of rows
in the fact table, the design should be clearly re-evaluated.
Since a junk dimension includes all
valid combinations of attributes, it will automatically track any changes in
dimension attributes. Therefore slowly changing dimension strategies do not
need to be considered for junk dimensions.
Degenerated
Dimension:
A dimension key, such as a transaction
number, invoice number, ticket number, or bill-of-lading number, that has no
attributes and hence does not join to an actual dimension table. Degenerate
dimensions are very common when the grain of a fact table represents a single
transaction item or line item because the degenerate dimension represents the
unique identifier of the parent. Degenerate dimensions often play an integral
role in the fact table's primary key.
Role-Playing
Dimensions:
Dimensions are often recycled for
multiple applications within the same database. For instance, a
"Date" dimension can be used for "Date of Sale", as well as
"Date of Delivery", or "Date of Hire". This is often referred
to as a "role-playing dimension".
Data
Mart (DM):
A Data Mart is the subset of the Datawarehouse
Monstercourses (DW). Usually data marts are oriented or focused towards one Line
of Business (LOB) or one Department or one Subject Area (Examples such as
Sales, Marketing or Finance) where as Data Warehouse deals with multiple
subject areas.
You can have two types of Data Marts.
Ø Dependent
Data Marts
Ø Independent
Data Marts
Dependent
Data Mart: If the data marts are designed using conformed dimensions, then they will be
related, Such Data Marts are called as Dependent Data Marts. Dependent data marts are usually built to
achieve improved performance and availability, better control, and lower
telecommunication costs resulting from local access of data relevant to a
specific department.
Independent
Data Mart: A data mart that is not dependent on
the other data marts is called the
independent data mart. The
creation of independent data marts is often driven by the need to have a
solution within a shorter time.
Bottom-up
design:
Ralph
Kimball, a well-known author on data warehousing, is a proponent of an approach
to data warehouse design frequently considered as bottom-up. In the so-called
bottom-up approach data marts are first created to provide reporting and
analytical capabilities for specific business processes. Data marts contain
atomic data and, if necessary, summarized data. These data marts can eventually
be unioned together to create a comprehensive data warehouse. The combination
of data marts is managed through the implementation of what Kimball calls
"a Datawarehouse Monstercourses bus architecture"
Business
value can be returned as quickly as the first data marts can be created.
Maintaining tight management over the data warehouse bus architecture is
fundamental to maintaining the integrity of the data warehouse. The most
important management task is making sure dimensions among data marts are
consistent. In Kimball's words, this means that the dimensions
"conform".
Top-down
design:
Bill Inmon, one of the first authors
on the subject of data warehousing, has defined a data warehouse as a
centralized repository for the entire enterprise. Inmon is one of the leading
proponents of the top-down approach to data warehouse design, in which the
Data warehouse is designed using a
normalized enterprise data model. "Atomic" data, that is, data at the
lowest level of detail, are stored in the data warehouse. Dimensional data
marts containing data needed for specific business processes or specific
departments are created from the data warehouse. In the Inmon vision the data
warehouse is at the center of the "Corporate Information Factory"
(CIF), which provides a logical framework for delivering business intelligence
(BI) and business management capabilities.
Data Warehouse Vs Data Mart:
|
Data Warehouse
|
Data Mart
|
Subjects
|
Multiple subject areas data.
|
Single partial or multiple partial subject areas data.
|
Data Perspective
|
Will have the Historical Detailed Data.
|
Will have the Detailed data with some historical data.
|
Implement time Frame
|
9–18 Months for the first stage.
Multiple stage implementations.
|
4-12 Months
|
ROI(Return of Investment)
|
Will take more time to get the ROI
|
Will take less time to get the ROI.
|
Implementation Cost
|
Expensive
|
Relatively Cheap.
|
Change Management
|
It is very difficult to implement the change.
|
Easy to implement the Change.
|
Corporate Support
|
It is very difficult to get the continuous support from all the
departments.
|
Can lead to independent data marts.
|
Database Design
|
Technical challenges in building the large databases.
|
We can easily build the databases.
|
Star Schema:
Star Schema is a relational database schema
for representing multi-dimensional data. It is the simplest form of data
warehouse schema that contains one or more dimensions and fact tables. It is
called a star schema because the entity-relationship diagram between dimensions
and fact tables resembles a star where one fact table is connected to multiple
Dimensions. The center of the star schema
consists of a large fact table and it points towards the dimension tables. The
advantages of star schema are slicing down, performance increase and easy
understanding of data.
Snowflake Schema:
A snowflake schema is a term that describes
a star schema structure normalized through the use of outrigger tables. I.e. dimension
table hierarchies are broken into simpler tables.
In star schema example we had 4 dimensions
like location, product, time, organization and a fact table (sales).
Snowflake schema, the example diagram shown
above has 4 dimension tables, 4 lookup tables and 1 fact table. The reason is
that hierarchies (category, branch, state, and month) are being broken out of
the dimension tables (PRODUCT, ORGANIZATION, LOCATION, and TIME) respectively
and shown separately. In OLAP, this Snowflake schema approach increases the
number of joins and poor performance in retrieval of data. In few
organizations, they try to normalize the dimension tables to save space. Since
dimension tables hold less space, Snowflake schema approach may be avoided.
Surrogate Key:
A surrogate
key is a substitution for the natural primary key.
It
is just a unique identifier or number for each row that can be used for the
primary key to the table. The only requirement for a surrogate primary key is
that it is unique for each row in the table.
Data warehouses typically use a surrogate, (also known as artificial
or identity key or Meaningless Key), key for the dimension tables primary keys.
They can use Info sequence generator, or Oracle sequence, or SQL Server
Identity values for the surrogate key. It is useful because the natural primary
key (i.e. Customer Number in Customer table) can change and this makes updates
more difficult.
Some tables have columns such as AIRPORT_NAME or CITY_NAME which are
stated as the primary keys (according to the business users) but, not only can
these change,
Indexing
on a numerical value is probably better and you could consider creating a
Surrogate
key called, say, AIRPORT_ID. This would be internal to the system and as far as
The
client is concerned you may display only the AIRPORT_NAME.
Another
benefit you can get from surrogate keys (SID) is: Tracking the SCD - Slowly
Changing Dimension.
Slowly Changing Dimensions:
Dimensions
that change over time are called Slowly Changing Dimensions. For instance, a product
price changes over time; People change their names for some reason; Country and
State names may change over time. These are a few examples of Slowly Changing
Dimensions since some changes are happening to them over a period of time.
Example:
Christina
is a customer with ABC Inc. She first lived in Chicago, Illinois. So, the
original entry in the customer lookup table has the following record:
Customer Key
|
Name
|
State
|
1001
|
Christina
|
Illinois
|
At a later date, she moved to Los Angeles, California on January,
2003. How should ABC Inc. now modify its customer table to reflect this change?
This is the "Slowly Changing Dimension" issue.
There
are in general three ways to solve this type of problem, and they are
categorized as follows:
Type 1: The new record replaces the original
record. No trace of the old record exists.
In Type 1 Slowly Changing Dimension, the new information simply
overwrites the original information. In other words, no history is kept.
In
our example, recall we originally have the following table:
Customer Key
|
Name
|
State
|
1001
|
Christina
|
Illinois
|
After
Christina moved from Illinois to California, the new information replaces the
new record, and we have the following table:
Customer Key
|
Name
|
State
|
1001
|
Christina
|
California
|
Advantages:
-
This is the easiest way to handle the Slowly Changing Dimension problem, since
there is no need to keep track of the old information.
Disadvantages:
-
All history is lost. By applying this methodology, it is not possible to trace
back in history. For example, in this case, the company would not be able to
know that Christina lived in Illinois before.
When
to use Type 1:
Type 1 slowly changing dimension should be used when it is not
necessary for the data warehouse to keep track of historical changes.
Type 2: A new record is added into the customer
dimension table. Therefore, the customer is treated essentially as two people.
In Type 2 Slowly Changing Dimension, a new record is added to the
table to represent the new information. Therefore, both the original and the
new record will be present. The new record gets its own primary key.
In
our example, recall we originally have the following table:
Customer Key
|
Name
|
State
|
1001
|
Christina
|
Illinois
|
After
Christina moved from Illinois to California, we add the new information as a
new row into the table:
Customer Key
|
Name
|
State
|
Eff_Start_Date
|
End_Date
|
1001
|
Christina
|
Illinois
|
01-June-2011
|
05_June_2011
|
1005
|
Christina
|
California
|
05-June-2011
|
null
|
Advantages:
-
This allows us to accurately keep all historical information.
Disadvantages:
-
This will cause the size of the table to grow fast. In cases where the number
of rows for the table is very high to start with, storage and performance can
become a concern.
-
This necessarily complicates the ETL process.
When to use Type 2:
Type
2 slowly changing dimension should be used when it is necessary for the data
warehouse to track historical changes.
Type 3: The original record is modified to
reflect the change.
In Type 3 Slowly Changing Dimension, there will be two columns to
indicate the particular attribute of interest, one indicating the original
value, and one indicating the current value. There will also be a column that
indicates when the current value becomes active.
In
our example, recall we originally have the following table:
Customer Key
|
Name
|
State
|
1001
|
Christina
|
Illinois
|
To
accommodate Type 3 Slowly Changing Dimension, we will now have the following
columns:
- Customer Key
- Name
- Original State
- Current State
- Effective Date
After
Christina moved from Illinois to California, the original information gets
updated, and we have the following table (assuming the effective date of change
is January 15, 2003):
Customer Key
|
Name
|
Original State
|
Current State
|
Effective Date
|
1001
|
Christina
|
Illinois
|
California
|
15-JAN-2003
|
Advantages:
-
This does not increase the size of the table, since new information is updated.
-
This allows us to keep some part of history.
Disadvantages:
-
Type 3 will not be able to keep all history where an attribute is changed more
than once. For example, if Christina later moves to Texas on December 15, 2003,
the California information will be lost.
When
to use Type 3:
Type
III slowly changing dimension should only be used when it is necessary for the
data warehouse to track historical changes, and when such changes will only
occur for a finite number of time.
Grain Level: Is the level at which data as to be captured
in the fact table and the dimension table. Grain level should be as low as
possible.
Data warehouses and their architectures vary
depending upon the specifics organization's situation.
Three common architectures are:
Ø Data Warehouse Architecture (Basic)
Ø Data Warehouse Architecture (with a Staging
Area)
Ø Data Warehouse Architecture (with a Staging
Area and Data Marts)
Data Warehouse
Architecture (Basic)
Figure below shows a simple architecture for
a data warehouse. End users directly
Access data derived from several source
systems through the data warehouse.
In Figure, the metadata and raw data of a
traditional OLTP system is present, as is an additional type of data, summary
data. Summaries are very valuable in data Warehouses because they pre-compute
long operations in advance. For example, a typical data warehouse query is to retrieve
something like August sales. A summary in Oracle is called a materialized view.
Data Warehouse Architecture (with a Staging
Area)
In
Figure below, you need to clean and process your operational data before
putting it into the warehouse. You can do this programmatically, although most
data warehouses use a
Staging
area instead. A staging area simplifies building summaries and general
warehouse management. Figure illustrates this typical architecture.
Data Warehouse Architecture (with a Staging Area and Data Marts)
Although the architecture in Figure above is
quite common, you may want to customize your warehouse’s architecture for
different groups within your organization.
You can do this by adding data marts, which
are systems designed for a particular line of business. Figure below
illustrates an example where purchasing, sales, and inventories are separated.
In this example, a financial analyst might want to analyze historical data for
purchases and sales.
There are three types of data warehouses:
- Enterprise Data Warehouse - An enterprise data warehouse provides a central database for decision support throughout the enterprise.
- ODS (Operational Data Store) - This has a broad enterprise wide scope, but unlike the real entertprise data warehouse, data is refreshed in near real time and used for routine business activity. An ODS may contain 30 to 90 days of information.
- Data Mart - Datamart is a subset of data warehouse and it supports a particular region, business unit or business function.
IF ANY Video Tutorials or IT Online Training NEEDED
PLEASE MAIL TO info@monstercourses.com,
http://www.monstercourses.com/