Data - Tiny Word - BIG Impact

big_data_wordcloud.jpg

At The BA Zone we like to stress the importance of user-centered design. Today, I'd like to focus on data - a tiny word - that can have significant impact on all aspects of an organization, such as:  architecture, database and software design, reporting, marketing, sales, processing, compliance, and a host of other areas, all of which can lead to success or failure.

From a broader perspective, data is aggregated, analyzed, manipulated, and used for advancement across a wide spectrum of organizations and industries, such as:

  • Consumer-focused products
  • eCommerce
  • Healthcare
  • Science and Engineering
  • Law Enforcement
  • Education
  • Entertainment

From punch cards to self-evolving, artificial neural networks that are capable of learning based on new data inputs - data has moved from supporting business to driving business. As a BA, you need a clear, concise understanding of data concepts, structures, mapping, and analysis in order to:

  • Enable you to ask the right questions
  • Identify business results that are attainable or require process changes
  • Understand the impact of adding or removing data
  • Ensure data integrity and security
  • Create conceptual and logical data models
  • Migrate applications/databases obtained through business acquisitions
  • Empower users with the right information at the right time

Data Basics

There are numerous books and articles (in print and online) that cover data and database creation and management. Let's dip a toe in the water and start by reviewing some basic data concepts (just a few for now):

Primary Data vs. Derived Data

  • Primary Data - is data that is entered into a system manually or via file feeds.
  • Derived Data - is data that results from taking two or more primary data elements and applying some type of calculation, algorithm, logic, etc., the result of which is called a derived data element.

De-identified Data vs. Identified Data

  • De-identification of data is the process of protecting the identify of individuals and their personal information by removing or masking identifying data such as name, date of birth, social security number, etc. By using a unique identifier, data can later be re-identified. For example, healthcare organizations de-identify customer information based on the Health Insurance Portability and Accountability Act (HIPAA), designed to protect patient medical records by providing privacy compliance standards. 
  • Identified data is data where an individual's personal, identifying information is not masked or removed and, therefore, viewable to everyone with access to that information.

Batch Processing vs. Real-Time Processing

Two methods of handling, processing, and ingesting data into a database include: Batch Processing and Real-time/Near-time Processing. Many systems today employ both methods.

  • Batch Processing - takes datasets, file feeds, etc. and breaks them down into batch jobs that are scheduled for automatic processing and input into the database. These jobs often include operations that perform validation and modification of incoming data, formatting data, and the handling of bad data, just to name a few.
  • Real-time or Near-time processing - reflects data that is processed immediately in real-time or as near to real-time (e.g. near-time) as possible.

In general, data is collectively stored in structured tables (entities), columns (attributes) and rows (records) within databases and updated depending on multiple variables/operations. Some basic operations that can be performed on data include:  create, update, delete, read-only (sometimes called - reference), and query.

Types of Databases

There are different types of databases based on function. Here are a few of the most commonly used database types:

Operational/Transactional Databases

As its name implies, an operational/transactional database is the real-time, database-of-record that manages and stores data elements from the day-to-day operational processes and/or transactions of an organization. For example, that Mystery Science Theater 3000 (MST3K) video-on-demand you purchased from RiffTrax - what was it again? Oh right, The Crawling Eye, (you have to see it to believe it) would be considered a transaction. In general, these databases reflect current events/data elements along with control data (e.g. flags, counters, etc.) and are designed for fast retrieval/update of data and usually provide minimal reporting.

Data Warehouses and Data Marts

Data Warehouses are, typically, integrated with one or more operational/transactional databases to manage and store multiple versions of events and data elements, creating a historical view and audit trail. For example, detailed, multiple, hospital admissions for a patient over a period of time. Data Warehouses can support high volume, analytical processing, and reporting capabilities. Whereas, data warehouses generally store information at the enterprise level, Data Marts are subsets of data warehouses and provide targeted information. For example, Data Marts can provide client specific dashboard views of information.

Distributed Databases

Distributed databases are individual databases or portions of a database stored in multiple physical locations that are synchronized by a centralized, software system called a distributed database management system (DDBMS). This type of database might be used in a company that has multiple branches or offices, etc. Although physically separated, to users this type of database looks like a single database.

Database Management System (DBMS)

A database management system (DBMS) is software designed to enable users to manage data in a database. Typically managed by a database administrator (DBA), using a query language, the DBMS software interacts with the database to: insert, remove, modify, validate, and retrieve data. The software also provides protection of the database through secure access and recovery measures from user/hardware failures.

How organizations structure their databases is dependent on multiple variables. These variables can be revealed using data modeling techniques.

Data Modeling

Data_Modeling.jpg

A data model is a visual representation of an organization's data that allows database designers, developers, and end users to understand and agree on the organization and manipulation/integrity of data, the relationships between data and any constraints on the data. Data models can be grouped into three levels:

  • Conceptual Model (also known as a Data Model) - is a high-level representation of the type of information an organization needs, primarily focusing on entities (e.g. person, place, thing, process, event, etc., which you can collect data on) and their relationships and constraints.
  • Logical Model - takes the conceptual/data model a step further by adding as much detail as possible (e.g. attributes/details of entities and detailed relationships) and then layers in some of the technical aspects of implementation without detailing the physical database structure.
  • Physical Model - specifies database design, structure, tables, columns, rows, constraints, keys (primary/foreign), implementation, and physical storage, etc.

As a Business Analyst, you might be asked to elicit and provide the Conceptual and/or Logical models. As you create these models, focus on capturing a clear, detailed model, free of ambiguity that communicates everyone's understanding of the data.

Types of Database Models 

There are a variety of data models (also called, data structures and data schema) out there.  Here's a high-level look at some (not all) of these models.

Hierarchical Database Model

The Hierarchical database model organizes data into a tree-like, linear structure with a parent-child relationship where each child (record) has a single parent (table). This model states that each child (record) must have only one parent (table), but each parent (table) can have one to many children (records). In order for data to be retrieved from this type of model, it must begin at the parent (or root node).

Relational Database Model

The Relational database model, currently the most widely used database model, logically organizes data as independent tables with each table assigned a key field that connects it to data in one or more other tables. Retrieval of data from this type of model uses these keys - a Primary Key (PK) and a Foreign Key (FK) -  to create a relationship between the tables and quickly access data.

Network Database Model

The Network database model expands the Hierarchical Model from a one parent - many children construct to a model that employs records and sets, where a set consists of one parent record and one or more child records, allowing a record type to be a child of more than one set. This model allows retrieval of data by navigating through these individual set instances.

Object-Oriented Database Model

The Object-Oriented database model (also called Object Database Management Systems - ODBMS) are databases that store their contents as objects. Objects contain data and executable code in the form of Attributes (data that defines the characteristics of an object) and Methods (the behavior of an object). Classes are used as a template to define the data and methods within the object. This type of model can be used to address complex data/complex data relationships.

Entity-Relationship Database Model

In the Entity-Relationship database model (also referred to as an Entity-Relationship Diagram (ERD), data is defined as Entities (people, places, things, processes) and Attributes (characteristics of an entity), which together make up their domain. The relationships (cardinality) between entities states how many rows in one entity will match rows in another entity (e.g. one-to-one, one-to-many, many-to-many, etc.).

Document Database Model

A Document database model (also referred to as a document store or document-oriented database) is a software program that stores, retrieves and manages document-oriented information (e.g. Microsoft Word, PDF, XML, etc.) that is semi-structured in nature. Instead of employing a table format (rows and columns, etc.), each document can have the same or different structure. To support this design, documents are grouped into Collections, which can then be queried (searched) for documents with certain attributes.

Entity-Attribute-Value Database Model

An Entity-Attribute-Value (EAV) database model (also referred to as an Object-Attribute-Value Model, vertical database model and open schema), is one where the Attribute and Value pair to describe one attribute of a given entity. Often used where the number of attributes could be limitless, but the number that will actually apply to a given entity is relatively modest. For example, a supermarket carries a limitless amount of products (that are being added, changed, updated, etc.). Each product (e.g. entity) might specify, size, weight, unit price, etc. (e.g. attributes) and each attribute may contain a price (e.g. value). From the limitless amount of potential products available, a customer may only purchase a modest bag of groceries.

Star Schema Database Model

In a Star Schema database model, data is organized into two categories:  Facts (events) surrounded by Dimensions (reference information about the Facts) that when diagrammed resemble the shape of a star. The relationships between Fact and Dimension tables are handled using keys. Each Dimension table is assigned a unique primary key. Each Fact table contains the unique primary key of each associated Dimension table and stores it as a foreign key. Considered the simplest database model, Star Schemas are commonly used in data marts and data warehouses.

Data - Byte By Byte

If you were not familiar with the data side of business analysis, you now have a lot to digest! Consider the information covered in this blog your appetizer portion. For the novice, data is best served one byte at a time. Bon Appetite!