AKA MONITOR
www.akamonitor.cz

   
 

RECENZE

 
 
 
 
 
 
 
 
 

DATA PROFILING

on-line výukový kurz společnosti „eLearningCurve“
( kurz je veden v angličtině )

Tento text navazuje na prezentaci vystavené na adrese:
www.akamonitor.cz/slideshow/arkady/

Autor kurzu

Arkady Maydanchik
- Data quality practitioner with 17+ years of experience
- Founderof Arkidata Corporation
- Since 2004 focused on education
- Created comprehensive program
- Data Quality Assessment for Practitioners
- Frequent speaker at various conferences and contributorto many magazines
- Author of Data Quality Assessment book
- Co-Founder of eLearningCurve

Přednosti on-line eLearningového kurzu

A
dvantage of eLearning environment
A great advantage of our eLearning environment is the ability to pause, rewind, and repeat the samé materiál several times. This allows you to fully comprehend the information. You háve 30 days to complete your review of the course.

Klíčové znaky Data Profilingu
The key characteristics of data profiling are:
- Analýze actual data
- Use computer programs
- Produce metadata
- Distinct from data quality assessment

Chybné představy o Data Profilingu
Data Profiling is often erroneously equated to gathering aggregate characteristics, frequency charts and distribution charts of individual attributes.
Data Profiling není to, co dělají profilingové nástroje, ale to, co děláte vy s nástroji
a s informacemi,
které nástroje produkují.
lt is a common misconception that Data Profiling is what profiling tools do.
In reality, it is not what the tools do, but what you do with the tools and with the
information the tools produce.
Data Profiling is the process of analyzing the data and identifying right questions to ask. It is analytical, rather than data processing disciplině!

Data Profiling is actually a collection of many techniques:
- Column Profiling
- Profiling Time-Dependent Data
- Profiling State-Transition Models
- Profiling Relational Data Models
- Attribute Dependency Profiling
- Subject Profiling
- Dynamic Data Profiling

Komu je kurz určen
This course is designed for data management practitioners:
- Involved in data quality management
- Involved in design and maintenance of DW and Bl applications
- Involved in design and maintenance of MDM hubs
- Involved in design and maintenance of data interfaces
- Involved in operational data migration and consolidation projects

Předpoklady k absolvování kurzu
Prerequisites:
- An interest and desire to learn about data profiling
- No technical skills required

Data Profiling jako profese

Data Profiling Profession
- Data profiling is a complex disciplině with extensive body of knowledge
- This course provides theoretical understanding and many examples, but there is no replacement for experience
It is advisable to háve recognized data profiling professionals in any organization.

Co se v kurzu dozvíte
In this course you will learn:
Various profiling techniques, from simple column profiling to advanced profiling methods for time-dependent and state-dependent data
What information to gather for different data types
How to gather data profiles with or without commercial tools
How to analýze data profiles and ask right questions about your data
How to perform dynamic data profiling and identity changes in data structure and meaning

Struktura kurzu
Course Structure
Module 1.   Introduction to Data Profiling
Module 2.   Column Profiling
Module 3.   Profiling Time-Dependent Data
Module 4.   Profiling State-Transition Models
Module 5.   Other Profiling Techniques

Závěry jednotlivých modulů

Summary I
-Data profiling is the process of analyzing actual data and understanding its true meaning and structure
- While there are many data profiling tools, data profiling is not what the tools do, but what you do with them
- Data profiling is a collection of many techniques
- Data profiling produces large volumes of metadata; without a well planned repository, profiling inevitably underachieves orfails
- Data profiling plays key role in most data management initiatives
-To be successful, data profiling requires IT expertise, business expertise, and specific experience with profiling techniques


Summary II
- Column profile contains metadata about values in an individual data column
- Column profiling is the most common type of profiling techniques because column profiles are relatively easy to produce, yet offer wealth of metadata
-The objective of column profiling is to learn where each attribute can be actually found, what is its domain, and (to some extent) what is its quality
- Column profiles consist of basic counts, value frequency charts, basic distribution characteristics, and value distribution charts
- Column profiling can be performed without specialized tools; however, profiling tools will significantly speed up the process and offer advanced functionality that is hard to replicate
- The reál challenge is not to gather column profiles but to efficiently analýze them and get valuable information out of potentially huge volume of produced metadata


Summary III
- Understanding the structure and quality of time-dependent data is key to most data-driven initiatives
-The objective of time-dependent data profiling is to learn how much history exists for different data categories, does it follow any predictable patterns, and does the data meaning change overtime
-Timeline and timestamp profiling help understand availability of historical data for various entities, events, and codes; it also helps discover changes in meaning of the data over time and find patterns in the data
- Advanced profiling for time-dependent data can be multi-dimensional
-Though no tools exist with explicit profiling functionality for time-dependent data, most needed results can be obtained by "creative" use of column profiling or through simple SQL queries
- Profiling time-dependent data requires heavy involvement of business users who understand underlying business processes


Summary IV
- The objects that go through a sequence of states in the course of their lifecycle as a resulí of various events are called state-dependent objects
-State-transition models use simple concepts of states and actions to describe constraints on the life cycle of state-dependent objects
- State-transition model profiling examines life cycle of actual state-dependent objects as it appears in the data, and provides information about the order, duration, and conditions for statě transitions
- Comparing actual data with state-transition models yields wealth of information about the data quality; profiling also allows to reverse-engineer state-transition models
- No profiling tools today recognize the structure of the data for state-dependent objects; profiling typically is doně through a combination of pre-processing, reasonably simple SQL statements and procedures, and column profiling tools


Summary V
- Subject profiling examines subjects across databases, as well as distribution of subject data within each daíabase, in order to understand where the information about each subject can be found
- Relational integrity profiling provides information about actual keys and relationships in relational databases
-Dependency profiling looks for hidden relationships between attribute values
- Dynamic profiling is a process of repeating profiling exercise on a regular basis and comparing the results for the purpose of identifying changes in the data structure and meaning

-------------------------------------------------------
Hlavní myšlenky tématu Data Profiling
Most Important Thoughts in Data Profiling
----------------------------------------------

Dynamic Data Profiling
One-time profiling exercise tels the past and current data structure and meaning. However, the structure and meaning of data will change over time.

How to Profile?
- There are many profiling íools, ranging widely in functionaliíy and price
- A significant part of profiling requires using SQL queries and procedures of varying complexity
-Some profiling techniques can be executed within tools with some creativity

How to Organize the Results ?
- Profiling produces great volumes of metadata
- These metadata are not only large in volume, but háve varying and often rather complex structure
- Without a well planned repository, profiling inevitably underachieves orfails

How to Analýze the Results ?
- There is so much information, you must know what to look for and what each profiling element can commonly be ušed for
- It is critical to háve an efficient analytical environment
- It is important to automate the process of looking through profiling results in search for valuable information
- Metadata repository must efficiently track all observations, questions, and answers

Profiling and Data Quality Management
Profiling is key to discovery of data quality rules, and thus it is the first step in a
data quality assessment initiative. Profiling is also critical to ongoing data quality monitoring.

Profiling and Master Data Management
Profiling is key to building MDM solutions. Profiling is also critical to ongoing management of MDM hubs.

Role of IT
- Data profiling requires IT expertise
- Data may need to be moved to a staging area and reorganized
- Tool selection process is largely done within IT
- Many profiling techniques require expertise with SQL
- Organizing results is an IT challenge
- Automating result analysis requires somewhat sophisticated algorithms s- Dynamic profiling is an IT challenge
You cannot succeed with data profiling without strong participation from the IT department.

Role of Business
- Data profiling requires business expertise and understanding of data
- Profiling is about asking questions, only business analysts can answer these questions
-Only business analysts can notice certain anomalies in data profiles
-Only business analysts can prioritize data profiling
What are Time-Dependent Data ?

Most real world objects change overtime and historical data comprise the majority of data in both operational systems and data warehouses.

To understand the true meaning, structure, and quality of the data

is critical to know answers to questions loke:
- How much history are there for each code?
- Do the timestamps follow certain patterns?
- Do the timestamp patterns change over time?
- Do the codes change meaning over time?

State-Transition Model Features
- Selection of states is driven by the business use of íhe model
- Same acíion can apply to many states, though it typically leads to the samé statě
- Multiple actions can lead to the samé statě transitions
- If all actions uniquely map to stale transition, the model may be wrong
- Objects can transition from ending terminátor states back to non-terminator states

State-transition model profiling involves:
- Terminátor Profiling
- State-Transition Profiling >Action Profiling
- State Duration Profiling
- State/Action Timeline Profiling
- Profiling Dependencies for State- and Action-Specific Attributes (discussed in the next Module)

Relational Data Models
Relational data models describe high-level logical data structure.
Data models are often not kept up-to-date with the actual data. Relational integrity profiling provides information about actual keys and relationships.

Identity Profiling
-The objective of identity profiling is to identity or verify identity keys
- Identity keys are not necessarily the samé as primary keys
- When identity keys are unknown, candidate keys can be identified through profiling
- Some tools háve this functionality
- Without tools, requires running multiple Group By statements

Subjects
- Subjects are the high-level business objects whose data are stored in the databases
- Subject profiling examines subjecís across databases, as well as distribution of subject data within each database, in order to understand where the information about each subject can be found
- The first step in subject profiling is to uniquely identity all subjects
- When dealing with multiple sources, the next step is subject matching
- The most basic profiling technique is to identity all locations of data for each subject
across company databases
- Additional subject profiling provides information about presence of subject data by entity
- In depth subject profiling provides information about the location of data for individual subjects

Relational model and relational integrity
- Relational models are often not kept up-to-date with the actual data
- Relational integrity profiling provides information about actual keys and
relationships
- The objective of identity profiling is to identify or verify identity keys
- The objective of referential integrity profiling is to identify or verify foreign keys
- Cardinality profiling is ušed to understand true relationship cardinality
- Some specialized tools help identify candidate primary and foreign keys, especially useful when dealing with "legacy" data
- Many tools handle de-duplication procedures
- The remaining profiling techniques are easily doně with SQL queries

Dependency profiling
Dependency profiling looks for hidden relationships between attribute values.
Dependency profiling is essential in:
- Dealing with "legacy" data
- Data consolidation projects
- Data quality assessment and data cleansing projects
- MDM initiative
- Analyzing data for state-dependent objects
The most common technique in dependency profiling is to look for affinity or disparity of values between attributes, which typically indicates a relationship.

Doplňkové informace o on-line kurzu
eLearningCurve presents our courses
ín Flash video formát. It runs as a presentatíon ín a web-browser and allows the student to pause, resumé, start-over and skip ahead ín a very flexíble manner. The presentatíon ís compatíble with any computer platform and browser wíth a Flash plug-ín ínstalled.
eLearningCurve courses target a broad range of IT and business aud
íences. The Addítíonal Information box on each courses description page describes audience and prerequisite ínformatíon for each course.
You may close out the presentat
íon and return to ít at a later tíme. The Course Server wíll track your progress so that you may resumé where you left off, even íf resumíng from a dífferent computer. Notě, however, when takíng an eLearningCurve exam, you may not pause or resumé the module. Exams must be taken from begínníng to end ín one síttíng.
Course enrollments are valíd for 1 month. Exams are available for 6 weeks. Extensíons of up to another 30 days may be granted upon request and wíll be granted at the sole díscretíon of eLearningCurve.

Závěr
V on-line kurzu společnosti „eLearningCurve“ získává účastník za cenu necelých 400 USD 30-ti denní přístup k velmi cennému materiálu.  Data Profiling je významnou částí rozsáhlé moderní vědní discipliny „Data Quality Managementu“.  Tato poměrně obsáhle pojatá recenze měla umožnit nahlédnutí do poznatkového základu DATA Profilingu.

doc. Arnošt Katolický
doc.aka(@)akamonitor.cz
29. září 2009.
 

 
 
 
 
 
 

www.akamonitor.cz