Narasimha's Blog on Microsoft Business Intelligence and TABLEAU: January 2010

Hi,

Fundamentals of Data Modelling

http://www.agiledata.org/essays/dataModeling101.html
Erwin
Mutating table:A mutating table is a table that is currently being modified by an update, delete, or insert statement.
The default value for query time-out is 30 seconds.

Use the %ISOPEN cursor status variable.
Difference between ODS and OLTP
ODS: It is nothing but a collection of tables created in the Data warehouse that maintains only current data.
OLTP maintains the data only for transactions; these are designed for recording daily operations and transactions of a business.

CUBE operator generalizes relational aggregates
Roll-up allows functional aggregations

Make use of the query governor configuration option. The query governor configuration option can be used to prevent system resources from being consumed by long-running queries. By default, the option is set to allow all queries to execute, no matter how long they take. However, you can set the query governor to limit the maximum number of seconds that all queries are allowed to execute for all connections, or just the queries for a specific connection. Because the query governor is based on estimated query cost, rather than actual elapsed time, it does not have any run-time overhead. It also stops long-running queries before they start, rather than running them until some predefined limit is hit

Relational Algebra
It is procedural query language. It consists of a set of operations that take one or two relations as input and produce a new relation.

SQL Server employs three types of join operations:
Nested loops join
Merge joins
Hash joins

Materialized ViewA materialized view is a replica of a target master from a single point in time. The master can be either a master table at a master site or a master materialized view at a materialized view site.
Read-Only, Updatable, and Writeable Materialized Views

Data integrity
Data integrity is an important feature in SQL Server. When used properly, it ensures that data is accurate, correct, and valid. It also acts as a trap for otherwise undetectable bugs within applications.
A PRIMARY KEY constraint is a unique identifier for a row within a database table. Every table should have a primary key constraint to uniquely identify each row and only one primary key constraint can be created for each table. The primary key constraints are used to enforce entity integrity.
A UNIQUE constraint enforces the uniqueness of the values in a set of columns, so no duplicate values are entered. The unique key constraints are used to enforce entity integrity as the primary key constraints.
A FOREIGN KEY constraint prevents any actions that would destroy links between tables with the corresponding data values. A foreign key in one table points to a primary key in another table. Foreign keys prevent actions that would leave rows with foreign key values when there are no primary keys with that value. The foreign key constraints are used to enforce referential integrity.
A CHECK constraint is used to limit the values that can be placed in a column. The check constraints are used to enforce domain integrity.
A NOT NULL constraint enforces that the column will not accept null values. The not null constraints are used to enforce domain integrity, as the check constraints.

De-normalization
De-normalization is the process of attempting to optimize the performance of a database by adding redundant data. It is sometimes necessary because current DBMS implement the relational model poorly. A true relational DBMS would allow for a fully normalized database at the logical level, while providing physical storage of data that is tuned for high performance. De-normalization is a technique to move from higher to lower normal forms of database modeling in order to speed up database access

Difference between view and materialized view
A view takes the output of a query and makes it appear like a virtual table and it can be used in place of tables. A materialized view provides indirect access to table data by storing the results of a query in a separate schema object.

Data model
The goal of the data model is to make sure that the all data objects required by the database are completely and accurately represented.

Data warehousing
A data warehouse is the main repository of an organization's historical data, its corporate memory. It contains the raw material for management's decision support system.

ODS:
ODS is abbreviation of Operational Data Store. A database structure that is a repository for near real-time operational data rather than long term trend data. The ODS may further become the enterprise shared operational database, allowing operational systems that are being reengineered to use the ODS as there operation databases.

Data Mart
A data mart (DM) is a specialized version of a data warehouse (DW). Like data warehouses, data marts contain a snapshot of operational data that helps business people to strategize based on analyses of past trends and experiences. The key difference is that the creation of a data mart is predicated on a specific, predefined need for a certain grouping and configuration of select data. A data mart configuration emphasizes easy access to relevant information (Reference: Wiki). Data Marts are designed to help manager make strategic decisions about their business.

Dimensional data model
Dimensional data model concept involves two types of tables and it is different from the 3rd normal form. This concepts uses Facts table which contains the measurements of the business and Dimension table which contains the context(dimension of calculation) of the measurements.

Fact tableFact table contains measurements of business process. Fact table contains the foreign keys for the dimension tables. Example, if you are business process is "paper production", "average production of paper by one machine" or "weekly production of paper" will be considered as measurement of business process.

Dimensional table
Dimensional table contains textual attributes of measurements stored in the facts tables. Dimensional table is a collection of hierarchies, categories and logic which can be used for user to traverse in hierarchy nodes.

Difference between ER Modeling and Dimensional Modeling
ER modeling is used for normalizing the OLTP database design. Dimensional modeling is used for de-normalizing the ROLAP/MOLAP design.

Domain:
A group of predefined column property characteristics that can be saved and attached to columns to speed model development.

Cardinality:
The ratio of parent instances to child instances. In IDEF1X, the cardinality of binary relationships is 1: n, whereby n may be one of the following:
zero, one, or more - signified by a blank space
one or more - signified by the letter P
zero or one - signified by the letter Z
exactly n - where n is some number

Forward engineering
The process of generating the physical database schema from the logical data model.

Reverse engineering:
The process of generating a logical model from a physical database.

Erwin calls the conversion / transformation of LDM => PDM as Forward Engineering which further leads to the actual code generation and the conversion of Code => PDM => LDM as Reverse Engineering!

One-to-many:
A relationship where each parent instance has zero, one, or more child instances.

Many-to-many
A relationship between two entities where instances in one entity have zero, one, or more related instances in the other entity, and vice versa (e.g., a many to-many relationship exists between students and classes. Each student can attend many classes; each class can include many students).

Restrict
A referential integrity option that prohibits the deletion of a parent instance if one or more dependent child instances exist.

Star schema: In the star schema design, a single object (the fact table) sits in the middle and is radically connected to other surrounding objects (dimension lookup tables) like a star. A star schema can be simple or complex. A simple star consists of one fact table; a complex star can have more than one fact table.
Snowflake schema: In this a Dimension can have multiple Levels within multiple Hierarchies. From each hierarchy of levels any one Level can be attached to Fact Table. Mostly Lowest Level is attached to Fact Table.
OR
Snowflake Schema, each dimension has a primary dimension table, to which one or more additional dimensions can join. The primary dimension table is the only table that can join to the fact table.

Difference between star flake and snow flake
Star schema - A single fact table with N number of Dimension, all dimensions will be linked directly with a fact table. This schema is denormalized and results in simple join and less complex query as well as faster results. Snow schema - Any dimensions with extended dimensions are know as snowflake schema, dimensions maybe interlinked or may have one to many relationship with other tables. This schema is normalized and results in complex join and very complex query as well as slower results.
Conformed dimensions mean the exact same thing with every possible fact table to which they are joined. They are common to the cubes.

Degenerate dimension table
If a table contains the values, which are neither dimensions nor measures is called degenerate dimensions.

Non-additive facts
Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table. However they are not considered as useless. If there is a change in dimensions the same facts can be useful.

Factless facts table
A fact table which does not contain numeric fact columns it is called factless facts table.
What is the purpose of Factless Fact Table?
Fact less tables are so called because they simply contain keys which refer to the dimension tables. Hence, they don’t really have facts or any information but are more commonly used for tracking some information of an event. Eg. To find the number of leaves taken by an employee in a month.

What is a level of Granularity of a fact table?
A fact table is usually designed at a low level of Granularity. This means that we need to find the lowest level of information that can store in a fact table.
E.g. Employee performance is a very high level of granularity.
Employee_performance_daily, employee_perfomance_weekly can be considered lower levels of granularity.

Slowly changing dimensions
SCD is abbreviation of slowly changing dimensions. SCD applies to cases where the attribute for a record varies over time. There are three different types of SCD.1) SCD1: The new record replaces the original record. Only onerecord exists in database - current data.
2) SCD2: A new record is added into the customer dimension table. Two records exist in database - current data and previous history data.
3) SCD3: The original data is modified to include new data. One record exists in database - new information are attached with old information in same row.

Hybrid SCDsHybrid SCDs are combination of both SCD 1 and SCD 2. It may happen that in a table, some columns are important and we need to track changes for them i.e., capture the historical data for them whereas in some columns even if the data changes, we don't care.

Data ModelingFollowing Data Modeling topics are covered

Entity Relationship Modeling
Ÿ Base Data Patterns
Ÿ Design Concepts

Dimension Modeling
Ÿ Dimension Modeling Terminology
Ÿ Dimensional Modeling Constructs

Overview of Data Modeling Tools

Entity Relationship Modeling – Base Data Patterns

Definition: A Data Pattern is a group of a few related entity types that implement a concept and can be applied consistently to model many different business requirements.

Types of Data Pattern
Dependency
Domain
Association
Domain-Based Association
Transaction

Dependency:
Dependency is based upon whether an occurrence of one entity type is dependent or independent of occurrences of another entity type.
Dependency is indicated by the optionality of each relationship.
Benefits:
Ensures an understanding of the business rules
Aids in project planning of database implementation
Helps segment data model for a prototype
Dependency Example:

An analyst at a brokerage firm must monitor one or more companies. A company that the brokerage tracks may or may not be monitored by an analyst. Therefore, Analyst is a dependent entity type, and company is an independent entity type.

Dependency Pattern:
One to Many (1: N) Optional ( B is dependant on A)
Optional One to Mandatory Many (1: N) A is dependent on B
Optional One to Optional Many (1: N) (No dependency)
One to Many (1: N) Mandatory (A and B are dependent on each other)
Domain Pattern:
A Domain pattern is used for an attribute that is mandatory and its set of values is finite
A Domain pattern maintains the specific set of values that are valid for an attribute.
Benefits:
Ensures data and referential integrity.
Facilitates responsiveness to business rule changes
Domain Example:

In the above example, Employee type has a finite set of values. An employee must be classified as a waiter, cashier, manager, dishwasher or chef.

Domain Implementation:
Code tables
Look-up tables
Validation lists
Permitted Values
Drop down list boxes

Association:
The Association pattern is used to resolve many-to-many relationships.
The Association pattern introduces a new entity type which can contain information about a relationship between two other entity types.
Benefits:
An association is necessary to male relational data model work.
An association provides a home for attributes that describe a relationship.
Association Example:
This means that a Stock can be purchased from many Clients and a Client may own many Stocks. Client Stock is an association entity type that holds data regarding a specific Stock owned by a specific Client. The primary key for Client Stock-Part is a compound key comprised of the foreign key "client_nbr" migrated from Client and the foreign key "stock_id" migrated from Stock.
Implementation:
Association relationships generally play two roles in business:
Tracking - records incidents which associate occurrences, such as in our Client/Stock example.
Authorizing - captures approval for future involvement.

Domain-Based Association:
The Domain-Based Association pattern is used when attributes can take on multiple values for a single occurrence of an entity type.
The Domain-Based Association pattern is a special use of an Association pattern where one end of a relationship is a Domain pattern.
Benefits:
Allows attributes to have multiple values for a single occurrence of an entity type.
Stabilizes the data model.
Provides responsiveness to changes in business rules.
Domain-Based Association Example:
Domain - Based Association Pattern Recognition:
The Domain-Based Association pattern is used for attributes that can take on multiple values for a single occurrence of an entity type.
Transaction:
The Transaction pattern assists in discovering and accurately diagramming a header/details data structure. It is typically used for purchase orders, invoices etc.
Transaction Example:
Order and Order Line represent the Transaction pattern, which is almost always related to some form of product or service as shown above.
Transaction Pattern Recognition:

Entity Relationship Modeling – Design Concepts
Normalization Definition
Normalization ensures that attributes are allocated to the correct entity type.
Normalization ensures that entity types contain data about only one thing.
Normalization cannot validate the accuracy of a model in reflecting business requirements.
Normalization Benefits:
Eliminates Data Anomalies (Create, Read, Update, Delete)
Verifies Structural Correctness
Maximizes Model Stability, Effectiveness and Reuse.

Normalization helps to stabilize a data model. If normalization rules are not followed, attributes will be duplicated or placed in the wrong entity type. This forces application to redundantly create, read, update and/or delete data (i.e., a data anomaly). The goal of a data model is to store one fact in one place to eliminate anomalies and provide greater effectiveness and reusability.

Normalization – Normal forms:
First Normal Form
Second Normal Form
Third Normal Form

First Normal Form
Rules include
Identify a primary key
Ensure attributes are in basic form
Remove repeating attributes
Functional Dependence:
Functional dependence is applied toward attributes and is the basis for normalization. An attribute is functionally dependent if its value replies solely on the primary key and would lose its meaning if separated from that primary key.

Second Normal Form:
Rules include:
Entity must already be in first normal form.
Move describing attributes that are not functionally dependent on the entire primary key to their own entity type.

Third Normal Form:
Rules include:
Entity must already be in second normal form.
Move describing attributes that are dependent upon one another to an new entity type.

Complex Data Patterns:
• Supertype – Subtype
• Recursive Hierarchy
• Recursive Network
• Role-Based Association
• Time-Dependent Data

Supertype – Subtype
Represents a grouping of entities that share some common characteristics, but which may also have unique characteristics.
These unique characteristics could be either attributes, relationships or business rules.
A Subtype must be associated with a supertype in a one-to-one optional relationship.

Rules to follow when using a Supertype – Subtype:
Subtypes use same (pk) as supertype
All Subtypes relate to the Supertype in a single one-to-one optional relationship.
A Domain must be used with the Supertype.
Use Subtypes for unique attributes or relationships.

Recursive Pattern:

Recursive Hierarchy
Recursive Network
1:M
M:M

If the relationship is hierarchical, each node can have many children but only one parent. If the relationship is a network each node can have many children and many parents.

Recursive Hierarchy:
A Recursive Hierarchy represents a one-to-many relationship form an entity type to itself. For example, employees that manage other employees, a company profit center that is comprised of other company profit centers, or a person that is a parent to other persons. Hierarchical recursive relationships are very common in data modeling.

Recursive Hierarchy – common uses
• Organizational structure
• Geographical structure
• General ledger accounting structure

Recursive Network:
A Recursive Network represents a many-to-many relationship between occurrences of the same entity type.
Recursive Network – Common uses
• Bill of materials
• Matrix management organization
• Medical procedures

Role – Based Association
A Role-Based Association supports multiple relationships between two entity types.

Time-Dependent Data Pattern:
The Time-Dependent pattern supports data values that change over time or aggregate within periods when it is necessary to maintain historic or future values.

Dimension Modeling:
Dimensional Modeling is a technique used to develop an information model that supports queries of business facts from different perspectives.
A dimensional model is often referred to as a “star schema”.

Ÿ Every dimensional model is composed of one table with multipart key, called the fact table, and a set of smaller tables called dimension tables
Ÿ Each dimension table has a single part primary key that corresponds exactly to one of the components of the multipart key in the fact table
Ÿ This characteristic star-like structure is often called a star join
Ÿ The fact table contains facts or measurements of the business
Ÿ The dimension tables contain textual attributes that describe the facts
Ÿ The attributes in the dimension tables are used for constraining & grouping data within data warehouse queries

Dimension Modeling Development:
Dimension Modeling Development:
Dimension Modeling Terminology:
Fact Table:
Definition:
The performance measures of the business
Usually stores numerical and additive measures.
The “What I want to know”

Characteristics:
Basis for analysis
Continuously valued
Can be derived or calculated
Column headers in query results
Examples:
Revenue
Quantity
Cost
Gross Margin

Dimension Table:
Definition:
Descriptions of the business. The “Which, Who, how, where or when that describes or explains the fact.”

Characteristics:
Constant
Enables “Slicing and dicing” the facts by different variables.
Examples:
Time
Customer
Product
Branch
Dimensional Attributes:
Definition:
Fields within the dimension table.
Describes each item associated with a dimension.
Characteristics:
Acts as a source of query constraints
DW is only as good as the attributes in the dimension table.
Examples:
Month, quarter, year, season, holiday, name, street, address, city, brand, category, description, region, type, manager.

Snowflake Model:
Ø Low cardinality dimensional attributes can be moved to the sub-dimension tables
Ø Controlled normalization of the dimension table leads to lower redundancy
Ø Hybrid data model combining Dimensional model and Third normal form Entity Relational model characteristics

Conformed Dimension:
Shared Dimensions Must Conform
l Option 1:Identical dimensions with the same keys, labels, definitions
and values

ITEM KEY
STORE KEY
DATE KEY
Sales Fact
PROMO KEY
..
Category
Brand Desc.
Item Desc.
Item Key
Sales Schema

Option 2: “Subset” of base dimension

Sales Schema
Item key Item Desc Brand Desc Category Desc
0001 Cheerios 10oz Cheerios Cereal
..
Desc.
Category
Brand desc.
Item Desc.
Item Key
ITEM KEY
STORE KEY
DATE KEY
Sales $
PROMO KEY
a Desc
Week Desc
Day-of-week
DATE KEY

Slowly Changing Dimensions
Dimensional attributes evolve over time
Ÿ For example, customers change their names, move, have children, adjust their Incomes

For every dimension attribute, need to identify “Changes” strategy
Ÿ May use combination of strategies within same dimension table
Type 1: Overwrite the changed attributes
Type 2: Add a New Dimension Record and update the end date of the current active Dimension record with the start date of the new dimension record
Type 3: Add a “Prior” Attribute

Data Warehouse Keys
Ÿ All tables (facts and dimensions) should use Data Warehouse generated surrogate keys
w Isolate warehouse keys from operational changes
w Improve performance
w Support Integration from multiple sources
w Enable tracking dimension changes

Ÿ Recommended
w Integer, sequence number
w Don’t use operational IDs or meaningful keys

Additive Measures:
Additivity is the ability of measures to be added across all dimensions of the fact table. Measures could be fully additive, semi additive or non additive
Ÿ Fully Additive: Perfectly additive across all dimensions
Example: Sales Quantity, Revenue
Ÿ Semi Additive: Not additive across one or more dimensions (usually across time). Need to query using single unit of non-additive dimension
Example: Account Balance, Inventory, number of customers (Measure of Intensity, head counts)
Ÿ Non-Additive: Non-additive. Store individual components and calculate ratio of sums vs. sum of ratio’s
Example: Profit margin (Ratios and Percentages) i.e., Ratio or Percentages should not be added 1:3, 30%, etc.
Overview of Data Modeling Tools:
A Data Modeling tool
Ÿ Improves productivity among developers when database designs are divided, shared, and reused
Ÿ Establishes corporate modeling standards
Ÿ Creates good documentation (metadata) in variety of useful formats
Ÿ Ensures consistency, reuse, and integration of enterprise data
Ÿ Enable creation the data model in one notation and conversion it to another notation without losing the meaning of the model
Ÿ Saves time by accelerating the creation of high-quality, high-performance physical database from logical model
Ÿ Conserves resources and improves accuracy by synchronizing model and database
Ÿ Requirements for a good data modeling tool
Ÿ Diagram Notation
Ÿ Both ER & dimensional modeling notation must be available
Ÿ Reverse Engineering
Ÿ Creation of a model based on the source data in the operational environment as well as from other external sources of data
Ÿ Forward Engineering
Ÿ Creation of the data definition language (DDL) for the target tables in the data warehouse databases
Ÿ Source to Target Mapping
Ÿ Linking of source data in the operational systems and external sources to the data in the databases in the target data warehouse
Ÿ Data Dictionary or Repository
Ÿ Contains metadata that describes the data model

Monday 4 January 2010

Data Modelling