5 V’s of Big Data - Data Concepts By Krishna

Big data: Data that is stored rapidly from various sources, massive in size, and is complicated for organizations to secure, analyze, and gain valuable insight from.

Big data challenges

Organizations encounter difficulties with the large amounts, rapid velocity, and diverse kinds of data being created and gathered. This burdens data storage systems and impacts the usefulness and insight that can be gained.

Big data challenges cannot be solved with traditional database and processing solutions. You need a good understanding of the big data characteristics: volume, variety, velocity, veracity, and value. You can then begin to find the right AWS data and analytics solutions for the 5 V challenges of big data.

Volume :

The size of data stored by an organization can be from terabytes to petabytes to sometimes even exabytes. This makes it difficult for traditional storage systems to manage efficiently. Every organization’s infrastructure has to support large amounts of scalable and durable data storage and must also be able to gather this data from many different sources. Every company with data needs vast amount of computing and storage capacity.

Global data creation is projected to grow to more than 180 zettabytes by 2025.

Transactional Data Transactional data includes the following:
- Customer information
- Online product purchases
- Service contracts
Temporary Data Temporary data includes the following:
- Moves you make in an online video game
- Internet browser cache
Objects Objects include the following:
- Images
- Email messages
- Text files
- Social media content
- Text messages
- Videos

Variety :

Analytics begins with data sources. A data source can be a folder on a file server, database, web page, and even a wearable device. Data sources can come from within an organization and from outside of one. Data variety is about how the data source organizes various types of data.

Structured data, Semi-Structured data, and unstructured data each have unique characteristics that form a recognizable pattern.

Structured Data – Data Stored in Tables Structured data includes the following:
- Customer Relationship Management (CRM) system
- Online forms
- Network logs
- Event reservation system
Semi Structured Data – Stored as Documents or Key-Value Pairs Semistructured data includes the following:
- CSV
- JSON
- XML
Unstructured Data – Stored as files Unstructured data includes the following:
- Clickstream data
- Emails
- Documents
- PDFs
- Photos
- Videos

Data Storage Methods

You can classify data source types based on the storage types that are used. Structured and Semi Structured data are stored in database management systems (DBMS). Unstructured data is stored in data lakes or object storage solutions.

Structured Data Stores:

Structured data is stored within a relational database management system (RDBMS). A database is a structured set of data stored in a computer. A RDBMS provides structure to the data, lifecycle maintenance, and management of interactions with other processes and systems.

Relational databases store data in tables that are related to one another. Within the table, there are columns and rows, which are defined by a schema. The schema defines the design of the data and its relationship to one another. The goal for relational databases is optimized storage. For querying the data, relational databases use the structured query language (SQL). There are many types of databases to choose from. To choose the type of database, you must understand the data type you have and how you will be using it.

To learn about the strengths and weaknesses of relational databases, expand each of the following two categories.

Strengths Strengths of a relational database include the following:
- Proven technology, widely adopted and understood
- Less risk involved
- Very good transactional latency
- Outstanding for relatively small data sets
Weakness Weaknesses of a relational database include the following:
- Scalability concerns
- Need to increase hardware capacities as data grows
- Fixed schema makes it difficult to make non-disruptive changes
- Struggles with storing unstructured data

Semi Structured Data Stores:

Semistructured data are often stored in non-relational database systems, sometimes called NoSQL databases. This term can be confusing. Non-relational or NoSQL does not mean the data stored cannot be queried using SQL. A better way to think of it is not only SQL.

NoSQL database storage methods

Non-relational databases are built to store semistructured data for rapid collection and retrieval. NoSQL databases store data as a collection of documents or key-value pairs.

There are several broad categories of non-relational databases, and data is stored in each to meet specific requirements. NoSQL databases are denormalized, meaning they can have duplicate data, and have the option of using internal document structure or no structure at all.

Key-Value Pair Key-value databases are a type of non-relational database that store unstructured data in the form of key-value pairs. Logically, data is stored in a single table. Within the table, the values are associated with a specific key. The values are stored in the form of blob objects and do not require a predefined schema. The values can be of nearly any type. Strengths include the following:
- Very flexibleAble to handle a wide variety of data typesKeys are linked directly to their values with no need for indexing or complex join operationsContent of a key can be conveniently copied to other systems without reprogramming the data
Weaknesses include the following:
- Difficult to query values because they are stored as a single blobUpdating or editing the content of a value is quite difficultNot all objects are conveniently modeled as key-value pairs
Document Stores Document stores are a type of non-relational database that store semistructured and unstructured data in the form of files. These files range in form, but include JSON, BSON, and XML. The files can be navigated using numerous languages, including Python and Node.js. Logically, files contain data stored as a series of elements. Each element is an instance of a person, place, thing, or event. For instance, the document store might hold a series of log files from a set of servers. These log files can each contain the specifics for that system without concern for what the log files in other systems contain. Strengths include the following:
- Flexibility
- No need to plan for a specific type of data when creating one
- Convenient to scale
Weaknesses:
- Cannot query across files

NoSQL databases have multiple languages that can be used to query their data, and they focus on object querying. NoSQL databases place the emphasis on compute power. In terms of scalability, NoSQL databases are good at being distributed across multiple servers or instances. NoSQL databases are commonly used for OLTP web and mobile applications.

To learn about the strengths and weaknesses of non-relational databases, expand each of the following two categories.

Strengths Strengths of NoSQL databases include the following:
- Ability to update schemas on the fly
- Faster development cycles
- Less downtime
- Scales really well for larger data sets
Weakness Weaknesses of NoSQL databases include the following:
- Data is not instantaneously updated with every change.
- It does not perform well for applications requiring extremely low transactional latency.
- It is not as a mature as relational database technology.

OLAP and OLTP Systems

Within databases, there are two primary methods for organizing information:

Online Transaction Processing (OLTP)

and

Online Analytical Processing (OLAP).

Two different systems are needed, based on how the resources supporting the database are being used: store and retrieve data.

Adding data to a database is called a write operation. Querying data in a database is called a read operation. These two operations use the same resources, but in different ways. This makes it difficult to optimize for one or the other.

Smaller databases have a tolerance for simultaneous write and read operations. However, in large databases, you are forced to sacrifice the performance of read operations to allow high-performance write operations—or sacrifice performance of write operations to allow for high-performance read operations.

The solution is to have an OLTP database optimized for write operations and an OLAP database optimized for read operations.

OLTP Systems:

Transactional databases are called OLTP databases. In an OLTP system, the most common queries are called lookup queries. OLTP is associated more with queries that return entire rows. The filters on this data are generally based on the key columns in that table. In this type of system, you might query to get details for a specific order.

OLAP Systems:

In an OLAP system, the most common queries are aggregate queries. These queries take large numbers of rows and reduce them to a single row by aggregating the values in one or more columns. OLAP is associated more with queries that only need a few columns. In this type of system, you might query to find out the total number of items sold on a specific date.

Row Based and Columnar Data Storage:

Data within a database should be indexed to allow a query to quickly find the data it needs to produce a result. Indexes control the way data is physically stored on disk. They physically group records into a predictable order based on the key values within the table. This plays a huge part in the speed and efficiency of queries.

Both OLTP and OLAP systems can use either indexing method. However, there are advantages to choosing the method that is best suited to the types of queries that will be run the majority of the time.

The following table breaks down some of the characteristics of each storage type.

Characteristic	Row-based	Columnar
Storage on disk	Row by row	Column by column
Read/write	Best at random reads and writes	Best at sequential reads and writes
Best for	Returning full rows of data based on a key	Returning aggregations of column values
Implementation	Transactional systems	Analytical processing
Data compression	Low to medium compression can be achieved	High compression is the norm

In practice, data is written to the OLTP database with a very high frequency. Records from that system are copied over to an OLAP system on a scheduled basis. In many solutions, the data being copied to the OLAP database is also transformed to provide clearer answers for analytic questions or pre-aggregated to enable rapid query results. This is all done with extract, transform, and load (ETL) operations.

Comparison of Data Storage Methods:

There are many factors that can help you determine which database type is best for a new project or program you are developing. Following are the key characteristics of the two database types we have covered.

Characteristic	Row-based	Columnar
Representation	Multiple tables, each containing columns and rows	Collection of documents Single table with keys and values
Data design	Normalized relational or dimensional data warehouse.	Denormalized document, wide column or key value
Optimized	Optimized for storage	Optimized for compute
Query style	Language: SQL	Language: Many Uses object querying
Scaling	Scale vertically	Scale horizontally
Implementation	OLTP business systems	Analytical systems

Purpose built Data Stores

The following table lists AWS services that provide purpose-built data stores.

Databases	Description
Amazon Aurora	High performance, high availability, scalable, proprietary serverless relational database management system (RDBMS) with full MySQL and PostgreSQL compatibility
Amazon Relational Database Service (Amazon RDS)	Managed relational database service in the cloud with various database engine options
Amazon Redshift	Cloud-based data warehousing with ML to deliver the best price performance at any scale
Amazon DynamoDB	Fast, flexible, and highly scalable NoSQL database
Amazon ElastiCache	Fully managed, cost-optimized, highly scalable data caching service for real-time performance
Amazon MemoryDB for Redis	Redis-compatible, durable, in-memory database for ultra-fast performance
Amazon DocumentDB (with MongoDB compatibility)	Fully managed, scalable JSON document database
Amazon Keyspaces (for Apache Cassandra)	Scalable, high availability, serverless, managed Apache Cassandra compatible database service
Amazon Neptune	High availability, scalable, serverless graph database
Amazon Timestream	Fast, scalable, and serverless time-series database
Amazon Quantum Ledger Database (QLDB)	Fully managed, cryptographically verifiable ledger database
AWS Database Migration Service (DMS)	Automated managed migration and replication service to move database and analytics workloads to AWS with minimal downtime and zero data loss

Velocity

There are four velocities for processing data.

Scheduled Scheduled batch processing represents data that is processed in a very large volume on a regularly scheduled basis. For instance, once a week or once a day. It is generally the same amount of data with each load, making these workloads predictable.
Periodic Periodic batch processing is a batch of data that is processed at irregular times. These workloads are often run after a certain amount of data has been collected. This can make them unpredictable and hard to plan around.
Near real – time Near real-time processing represents streaming data that is processed in small individual batches. The batches are continuously collected and then processed within minutes of the data generation.
Real-time Real-time processing represents streaming data that is processed in very small individual batches. The batches are continuously collected and then processed within milliseconds of the data generation.

Data integrity and preventing potential issues

Data changes over time. As it is transferred from one process to another, and through one system to another, there are opportunities for the integrity of the data to be negatively impacted.

The importance of Data Integrity

You must ensure that you maintain a high level of certainty that the data you are analyzing is trustworthy. Data veracity is contingent on the integrity of the data. Data integrity is all about making sure your data is trustworthy. You need to make sure it has integrity, and that the entire data chain is secure and free from compromise. Understanding the full lifecycle of your data and knowing how to protect it effectively will greatly strengthen the integrity of your data.

Identifying Data Integrity issues

A data analyst might be called to perform data integrity checks. During this process they look for potential sources of data integrity problems.

Data can come from both internal and external sources. It is highly unlikely that they will influence data generated outside of the organization. However, within the organization, they might have the ability to make recommendations on improvements for the data sources they will be interacting with.

Data analysts may need to determine the integrity of the data sources and make adjustments to account for any integrity deficiencies.

Transforming Data with ETL Process

Extract, transform, and load (ETL) is the process of collecting data from raw data sources and transforming that data into a common type. This new data is loaded into a final location to be available for analysis and inspection.

The ETL approach uses a set of business rules to process data from several sources before centralized integration.

The purpose of the ETL process

The purpose of the ETL process is as follows:

To ensure the data has the required accuracy, precision, and depth
To bring together data from different sources to gain a complete picture
To build purpose-built data sets to answer key business questions

To learn more about what happens in each step, expand each of the following three categories.

Extract Data The extraction phase of this process is arguably the most important of all phases. The data required for most analytics transformations will likely come from multiple locations and be of multiple types, such as transaction logs, product databases, public data sources, or application streams. There are four key areas you must plan for:
1. You must identify where all of the source data resides. This may be data stored on-premises by your company, but can also include data that is found online.
2. You must carefully plan when the extraction will take place due to the potential impact of the copy process on the source system.
3. You must plan for where the data will be stored during processing. This is generally referred to as a staging location.
4. You must plan for how often the extraction must be repeated.
After you have determined where your data is coming from and what you want, you will extract that information and place it into a staging location.
Transform Data Transforming your data into a uniform, queryable format is really the heart of the ETL process. This phase involves using a series of rules and algorithms to massage the data into its final form. Data cleansing also occurs during this part of the process. Transformations can be basic, such as cleaning data to update formats or to perform data substitutions. This could be replacing NULL values with a zero or replacing the word female with the letter F. These seemingly small changes can have a huge impact on the usefulness of this data to analysts later, in the visualization process. Transformations can also be more advanced, including applying business rules to the data to calculate new values. Filtering, complex join operations, aggregating rows, splitting columns, and data validation are all very common types of transformations applied at this phase.
Load Data The final phase of the ETL process is where you store the newly transformed data. The planning steps you took in the extraction phase will dictate the form the final data store must take. This could be a database, data warehouse, or data lake. After the process has successfully completed, the data in this location is ready to be analyzed.

The final phase of the ETL process is where you store the newly transformed data. The planning steps you took in the extraction phase will dictate the form the final data store must take. This could be a database, data warehouse, or data lake. After the process has successfully completed, the data in this location is ready to be analyzed.

ELT process steps

In modern cloud-based environments, the extract, load, and transform (ELT) approach loads data as it is. It then transforms it at a later stage, depending on the use case and analytics requirements.

The steps are similar to those in the ETL process, performed in a different order, but with similar results.

The ELT process requires more definition at the beginning. Analytics must be involved from the start to define target data types, structures, and relationships.

These are the three steps of ELT:

Extract raw data from various sources.
Load it in its natural state into a data warehouse or data lake.
Transform it as needed while in the target system.

With ELT, all data cleansing, transformation, and enrichment occur within the data warehouse. You can interact with and transform the raw data as many times as needed.

A comparison of the ETL and ELT processes

Data scientists mainly use ETL to load legacy databases in the data warehouse, and use ELT with modern databases.

Value :

Query and Reporting :

We have learned about data collection and the speed of collection, data storage, capacity, processing, and analytics. All of these have one thing in common: they do not evaluate the data for usefulness. This is the importance of value. Before making decisions, it is important to extract value from your data. The process of extracting, filtering, and customizing your data can be accomplished by creating queries. Value can be derived from data by querying the data and generating meaningful reports.

Analytical reporting is used to transform data into actionable information that empowers organizations to make informed decisions, optimize processes, and achieve strategic objectives. Building a solid report that will provide consumers with what they need to make critical decisions is a bit of an art form. There are a few steps to be successful:

Gather the data, facts, action items, and conclusions.
Identify the audience, expectations they have, and the proper method of delivery.
Identify the visualization styles and report style that will best fit the needs of the audience.
Create the reports and dashboards.

Querying and reporting provide decision-makers with relevant and timely information. They bridge the gap between raw data and actionable insights, so businesses can use their data assets effectively and gain a competitive advantage.

Visualizing Data

Analytics exists to help you get the most value and useful insights from raw data. With reporting tools, you can create visual representations of data, such as charts, graphs, and dashboards. Visualization makes complex data more accessible and understandable, helping users quickly identify trends, patterns, and anomalies.

With well-built visual data, you can describe the data, why it is important, and how to move forward with the information provided. To build an effective visual, identify the types of analytics that meet the needs of the analytics request.

Reporting comes in many shapes and sizes. Organize your reports to meet the needs of the consumers of the reports. When creating reports and dashboards, use charts, tables, and graphs to answer questions. The clearer the questions, the better the answers the report or dashboard will provide.

You can break reports and dashboards into pages or views. These pages should have a single theme for all the report elements within them. Provide filters that the report consumer can apply to either the whole page or to individual elements within the page.

There are three broad types of visual reports: static, interactive, and dashboards.