Figure 1.1 What’s d...

نتيجة التلخيص (32%)

Figure 1.1 What's driving the data deluge The rate of data creation is accelerating, driven by many of the items in Figure 1.1 .Analytic Sandbox (workspaces) Data assets gathered from multiple sources and technologies for analysis Enables flexible, high-performance analysis in a nonproduction environment; can leverage in-database processing Reduces costs and risks associated with data replication into "shadow" file systems "Analyst owned" rather than "DBA owned" There are several things to consider with Big Data Analytics projects to ensure the approach fits with the desired goals. Due to the characteristics of Big Data, these projects lend themselves to decision support for high-value, strategic decision making with high processing complexity. The analytic techniques used in this context need to be iterative and flexible, due to the high volume of data and its complexity. Performing rapid and complex analysis requires high throughput network connections and a consideration for the acceptable amount of latency. For instance, developing a real-time product recommender for a website imposes greater system demands than developing a near-real time recommender, which may still provide acceptable performance, have slightly greater latency, and may be cheaper to deploy. These considerations require a different approach to thinking about analytics challenges, which will be explored further in the next section. 1.2 State of the Practice in Analytics Current business problems provide many opportunities for organizations to become more analytical and data driven, as shown in Table 1.2 . Table 1.2 Business Drivers for Advanced Analytics Business Driver Optimize business operations Examples Sales, pricing, profitability, efficiency Identify business risk Predict new business opportunities Customer churn, fraud, default Upsell, cross-sell, best new customer prospects Comply with laws or regulatory requirements Table 1.2 Anti-Money Laundering, Fair Lending, Basel II-III, Sarbanes-Oxley (SOX) outlines four categories of common business problems that organizations contend with where they have an opportunity to leverage advanced analytics to create competitive advantage. Rather than only performing standard reporting on these areas, organizations can apply advanced analytical techniques to optimize processes and derive more value from these common tasks. The first three examples do not represent new problems. Organizations have been trying to reduce customer churn, increase sales, and cross-sell customers for many years. What is new is the opportunity to fuse advanced analytical techniques with Big Data to produce more impactful analyses for these traditional problems. The last example portrays emerging regulatory requirements. Many compliance and regulatory laws have been in existence for decades, but additional requirements are added every year, which represent additional complexity and data requirements for organizations. Laws related to anti-money laundering (AML) and fraud prevention require advanced analytical techniques to comply with and manage properly. 1.2.1 BI Versus Data Science The four business drivers shown in Table 1.2 require a variety of analytical techniques to address them properly. Although much is written generally about analytics, it is important to distinguish between BI and Data Science. As shown in Figure 1.8 ways to compare these groups of analytical techniques. , there are several Figure 1.8 Comparing BI with Data Science One way to evaluate the type of analysis being performed is to examine the time horizon and the kind of analytical approaches being used. BI tends to provide reports, dashboards, and queries on business questions for the current period or in the past. BI systems make it easy to answer questions related to quarter-to-date revenue, progress toward quarterly targets, and understand how much of a given product was sold in a prior quarter or year. These questions tend to be closed-ended and explain current or past behavior, typically by aggregating historical data and grouping it in some way. BI provides hindsight and some insight and generally answers questions related to "when" and "where" events occurred.In-database processing for deep analytics enables faster turnaround time for developing and executing new analytic models, while reducing, though not eliminating, the cost associated with data stored in local, "shadow" file systems. In addition, rather than the typical structured data in the EDW, analytic sandboxes can house a greater variety of data, such as raw data, textual data, and other kinds of unstructured data, without interfering with critical production databases. Table 1.1 summarizes the characteristics of the data repositories mentioned in this section. Table 1.1 Types of Data Repositories, from an Analyst Perspective Data Repository Characteristics Spreadsheets and data marts ("spreadmarts") Spreadsheets and low-volume databases for recordkeeping Analyst depends on data extracts.Rather than aggregating historical data to look at how many of a given product sold in the previous quarter, a team may employ Data Science techniques such as time series analysis, further discussed in Chapter 8, "Advanced Analytical Theory and Methods: Time Series Analysis," to forecast future product sales and revenue more accurately than extending a simple trend line. In addition, Data Science tends to be more exploratory in nature and may use scenario optimization to deal with more open-ended questions. This approach provides insight into current activity and foresight into future events, while generally focusing on questions related to "how" and "why" events occur.Data Warehouses Centralized data containers in a purpose-built space Supports BI and reporting, but restricts robust analyses Analyst dependent on IT and DBAs for data access and schema changes Analysts must spend significant time to get aggregated and disaggregated data extracts from multiple sources.Figure 1.2 Examples of what can be learned through genotyping, from 23andme.com As illustrated by the examples of social media and genetic sequencing, individuals and organizations both derive benefits from analysis of ever-larger and more complex datasets that require increasingly powerful analytical capabilities.See Figure 1.4 .See Figure 1.5 .

النص الأصلي

Figure 1.1
What’s driving the data deluge
The rate of data creation is accelerating, driven by many of the items in Figure 1.1
.
Social media and genetic sequencing are among the fastest-growing sources of Big Data
and examples of untraditional sources of data being used for analysis.
For example, in 2012 Facebook users posted 700 status updates per second worldwide,
which can be leveraged to deduce latent interests or political views of users and show
relevant ads. For instance, an update in which a woman changes her relationship status
from “single” to “engaged” would trigger ads on bridal dresses, wedding planning, or
name-changing services.
Facebook can also construct social graphs to analyze which users are connected to each
other as an interconnected network. In March 2013, Facebook released a new feature
called “Graph Search,” enabling users and developers to search social graphs for people
with similar interests, hobbies, and shared locations.
Another example comes from genomics. Genetic sequencing and human genome mapping
provide a detailed understanding of genetic makeup and lineage. The health care industry
is looking toward these advances to help predict which illnesses a person is likely to get in
his lifetime and take steps to avoid these maladies or reduce their impact through the use
of personalized medicine and treatment. Such tests also highlight typical responses to
different medications and pharmaceutical drugs, heightening risk awareness of specific
drug treatments.
While data has grown, the cost to perform this work has fallen dramatically. The cost to
sequence one human genome has fallen from $100 million in 2001 to $10,000 in 2011,
and the cost continues to drop. Now, websites such as 23andme (Figure 1.2
) offer
genotyping for less than $100. Although genotyping analyzes only a fraction of a genome
and does not provide as much granularity as genetic sequencing, it does point to the fact
that data and complex analysis is becoming more prevalent and less expensive to deploy.
Figure 1.2
Examples of what can be learned through genotyping, from 23andme.com
As illustrated by the examples of social media and genetic sequencing, individuals and
organizations both derive benefits from analysis of ever-larger and more complex datasets
that require increasingly powerful analytical capabilities.
1.1.1 Data Structures
Big data can come in multiple forms, including structured and non-structured data such as
financial data, text files, multimedia files, and genetic mappings. Contrary to much of the
traditional data analysis performed by organizations, most of the Big Data is unstructured
or semi-structured in nature, which requires different techniques and tools to process and
analyze. [2] Distributed computing environments and massively parallel processing (MPP)
architectures that enable parallelized data ingest and analysis are the preferred approach to
process such complex data.
With this in mind, this section takes a closer look at data structures.
Figure 1.3
shows four types of data structures, with 80–90% of future data growth coming
from non-structured data types. [2] Though different, the four are commonly mixed. For
example, a classic Relational Database Management System (RDBMS) may store call
logs for a software support call center. The RDBMS may store characteristics of the
support calls as typical structured data, with attributes such as time stamps, machine type,
problem type, and operating system. In addition, the system will likely have unstructured,
quasi- or semi-structured data, such as free-form call log information taken from an e-mail
ticket of the problem, customer chat history, or transcript of a phone call describing the
technical problem and the solution or audio file of the phone call conversation. Many
insights could be extracted from the unstructured, quasi- or semi-structured data in the call
center data.
Figure 1.3
Big Data Growth is increasingly unstructured
Although analyzing structured data tends to be the most familiar technique, a different
technique is required to meet the challenges to analyze semi-structured data (shown as
XML), quasi-structured (shown as a clickstream), and unstructured data.
Here are examples of how each of the four main types of data structures may look.
Structured data: Data containing a defined data type, format, and structure (that is,
transaction data, online analytical processing [OLAP] data cubes, traditional
RDBMS, CSV files, and even simple spreadsheets). See Figure 1.4
.
Semi-structured data: Textual data files with a discernible pattern that enables
parsing (such as Extensible Markup Language [XML] data files that are self
describing and defined by an XML schema). See Figure 1.5
.
Quasi-structured data: Textual data with erratic data formats that can be formatted
with effort, tools, and time (for instance, web clickstream data that may contain
inconsistencies in data values and formats). See Figure 1.6
.
Unstructured data: Data that has no inherent structure, which may include text
documents, PDFs, images, and video. See Figure 1.7
.
Figure 1.4
Example of structured data
Figure 1.5
Example of semi-structured data
Figure 1.6
Example of EMC Data Science search results
Figure 1.7
Example of unstructured data: video about Antarctica expedition [3]
Quasi-structured data is a common phenomenon that bears closer scrutiny. Consider the
following example. A user attends the EMC World conference and subsequently runs a
Google search online to find information related to EMC and Data Science. This would
and a list of
results, such as in the first graphic of Figure 1.5
produce a URL such as https://www.google.com/#q=EMC+ data+science
.
After doing this search, the user may choose the second link, to read more about the
headline “Data Scientist—EMC Education, Training, and Certification.” This brings the
user to an emc.com
site
focused on this topic and a new URL,
https://education.emc.com/guest/campaign/data_science.aspx
page shown as (2) in Figure 1.6
,
that displays the
. Arriving at this site, the user may decide to click to learn
more about the process of becoming certified in data science. The user chooses a link
toward the top of the page on Certifications, bringing the user to a new URL:
https://education.emc.com/guest/certification/framework/stf/data_science.aspx
which is (3) in Figure 1.6
.
Visiting these three websites adds three URLs to the log files monitoring the user’s
computer or network use. These three URLs are:
https://www.google.com/#q=EMC+data+science
https://education.emc.com/guest/campaign/data_science.aspx
https://education.emc.com/guest/certification/framework/stf/data_science.aspx
This set of three URLs reflects the websites and actions taken to find Data Science
information related to EMC. Together, this comprises a clickstream that can be parsed and
mined by data scientists to discover usage patterns and uncover relationships among clicks
and areas of interest on a website or group of sites.
The four data types described in this chapter are sometimes generalized into two groups:
structured and unstructured data. Big Data describes new kinds of data with which most
organizations may not be used to working. With this in mind, the next section discusses
common technology architectures from the standpoint of someone wanting to analyze Big
Data.
1.1.2 Analyst Perspective on Data Repositories
The introduction of spreadsheets enabled business users to create simple logic on data
structured in rows and columns and create their own analyses of business problems.
Database administrator training is not required to create spreadsheets: They can be set up
to do many things quickly and independently of information technology (IT) groups.
Spreadsheets are easy to share, and end users have control over the logic involved.
However, their proliferation can result in “many versions of the truth.” In other words, it
can be challenging to determine if a particular user has the most relevant version of a
spreadsheet, with the most current data and logic in it. Moreover, if a laptop is lost or a file
becomes corrupted, the data and logic within the spreadsheet could be lost. This is an
ongoing challenge because spreadsheet programs such as Microsoft Excel still run on
many computers worldwide. With the proliferation of data islands (or spreadmarts), the
need to centralize the data is more pressing than ever.
As data needs grew, so did more scalable data warehousing solutions. These technologies
enabled data to be managed centrally, providing benefits of security, failover, and a single
repository where users could rely on getting an “official” source of data for financial
reporting or other mission-critical tasks. This structure also enabled the creation of OLAP
cubes and BI analytical tools, which provided quick access to a set of dimensions within
an RDBMS. More advanced features enabled performance of in-depth analytical
techniques such as regressions and neural networks. Enterprise Data Warehouses (EDWs)
are critical for reporting and BI tasks and solve many of the problems that proliferating
spreadsheets introduce, such as which of multiple versions of a spreadsheet is correct.
EDWs—and a good BI strategy—provide direct data feeds from sources that are centrally
managed, backed up, and secured.
Despite the benefits of EDWs and BI, these systems tend to restrict the flexibility needed
to perform robust or exploratory data analysis. With the EDW model, data is managed and
controlled by IT groups and database administrators (DBAs), and data analysts must
depend on IT for access and changes to the data schemas. This imposes longer lead times
for analysts to get data; most of the time is spent waiting for approvals rather than starting
meaningful work. Additionally, many times the EDW rules restrict analysts from building
datasets. Consequently, it is common for additional systems to emerge containing critical
data for constructing analytic datasets, managed locally by power users. IT groups
generally dislike existence of data sources outside of their control because, unlike an
EDW, these datasets are not managed, secured, or backed up. From an analyst perspective,
EDW and BI solve problems related to data accuracy and availability. However, EDW and
BI introduce new problems related to flexibility and agility, which were less pronounced
when dealing with spreadsheets.
A solution to this problem is the analytic sandbox, which attempts to resolve the conflict
for analysts and data scientists with EDW and more formally managed corporate data. In
this model, the IT group may still manage the analytic sandboxes, but they will be
purposefully designed to enable robust analytics, while being centrally managed and
secured. These sandboxes, often referred to as workspaces, are designed to enable teams
to explore many datasets in a controlled fashion and are not typically used for enterprise
level financial reporting and sales dashboards.
Many times, analytic sandboxes enable high-performance computing using in-database
processing—the analytics occur within the database itself. The idea is that performance of
the analysis will be better if the analytics are run in the database itself, rather than bringing
the data to an analytical tool that resides somewhere else. In-database analytics, discussed
further in Chapter 11, “Advanced Analytics—Technology and Tools: In-Database
Analytics,” creates relationships to multiple data sources within an organization and saves
time spent creating these data feeds on an individual basis. In-database processing for
deep analytics enables faster turnaround time for developing and executing new analytic
models, while reducing, though not eliminating, the cost associated with data stored in
local, “shadow” file systems. In addition, rather than the typical structured data in the
EDW, analytic sandboxes can house a greater variety of data, such as raw data, textual
data, and other kinds of unstructured data, without interfering with critical production
databases. Table 1.1 summarizes the characteristics of the data repositories mentioned in
this section.
Table 1.1 Types of Data Repositories, from an Analyst Perspective
Data Repository Characteristics
Spreadsheets and
data marts
(“spreadmarts”)
Spreadsheets and low-volume databases for recordkeeping
Analyst depends on data extracts.
Data Warehouses
Centralized data containers in a purpose-built space
Supports BI and reporting, but restricts robust analyses
Analyst dependent on IT and DBAs for data access and schema
changes
Analysts must spend significant time to get aggregated and
disaggregated data extracts from multiple sources.
Analytic Sandbox
(workspaces)
Data assets gathered from multiple sources and technologies for
analysis
Enables flexible, high-performance analysis in a nonproduction
environment; can leverage in-database processing
Reduces costs and risks associated with data replication into
“shadow” file systems
“Analyst owned” rather than “DBA owned”
There are several things to consider with Big Data Analytics projects to ensure the
approach fits with the desired goals. Due to the characteristics of Big Data, these projects
lend themselves to decision support for high-value, strategic decision making with high
processing complexity. The analytic techniques used in this context need to be iterative
and flexible, due to the high volume of data and its complexity. Performing rapid and
complex analysis requires high throughput network connections and a consideration for
the acceptable amount of latency. For instance, developing a real-time product
recommender for a website imposes greater system demands than developing a near-real
time recommender, which may still provide acceptable performance, have slightly greater
latency, and may be cheaper to deploy. These considerations require a different approach
to thinking about analytics challenges, which will be explored further in the next section.
1.2 State of the Practice in Analytics
Current business problems provide many opportunities for organizations to become more
analytical and data driven, as shown in Table 1.2
.
Table 1.2
Business Drivers for Advanced Analytics
Business Driver
Optimize business operations
Examples
Sales, pricing, profitability, efficiency
Identify business risk
Predict new business
opportunities
Customer churn, fraud, default
Upsell, cross-sell, best new customer prospects
Comply with laws or regulatory
requirements
Table 1.2
Anti-Money Laundering, Fair Lending, Basel II-III,
Sarbanes-Oxley (SOX)
outlines four categories of common business problems that organizations
contend with where they have an opportunity to leverage advanced analytics to create
competitive advantage. Rather than only performing standard reporting on these areas,
organizations can apply advanced analytical techniques to optimize processes and derive
more value from these common tasks. The first three examples do not represent new
problems. Organizations have been trying to reduce customer churn, increase sales, and
cross-sell customers for many years. What is new is the opportunity to fuse advanced
analytical techniques with Big Data to produce more impactful analyses for these
traditional problems. The last example portrays emerging regulatory requirements. Many
compliance and regulatory laws have been in existence for decades, but additional
requirements are added every year, which represent additional complexity and data
requirements for organizations. Laws related to anti-money laundering (AML) and fraud
prevention require advanced analytical techniques to comply with and manage properly.
1.2.1 BI Versus Data Science
The four business drivers shown in Table 1.2
require a variety of analytical techniques to
address them properly. Although much is written generally about analytics, it is important
to distinguish between BI and Data Science. As shown in Figure 1.8
ways to compare these groups of analytical techniques.
, there are several
Figure 1.8
Comparing BI with Data Science
One way to evaluate the type of analysis being performed is to examine the time horizon
and the kind of analytical approaches being used. BI tends to provide reports, dashboards,
and queries on business questions for the current period or in the past. BI systems make it
easy to answer questions related to quarter-to-date revenue, progress toward quarterly
targets, and understand how much of a given product was sold in a prior quarter or year.
These questions tend to be closed-ended and explain current or past behavior, typically by
aggregating historical data and grouping it in some way. BI provides hindsight and some
insight and generally answers questions related to “when” and “where” events occurred.
By comparison, Data Science tends to use disaggregated data in a more forward-looking,
exploratory way, focusing on analyzing the present and enabling informed decisions about
the future. Rather than aggregating historical data to look at how many of a given product
sold in the previous quarter, a team may employ Data Science techniques such as time
series analysis, further discussed in Chapter 8, “Advanced Analytical Theory and
Methods: Time Series Analysis,” to forecast future product sales and revenue more
accurately than extending a simple trend line. In addition, Data Science tends to be more
exploratory in nature and may use scenario optimization to deal with more open-ended
questions. This approach provides insight into current activity and foresight into future
events, while generally focusing on questions related to “how” and “why” events occur.
Where BI problems tend to require highly structured data organized in rows and columns
for accurate reporting, Data Science projects tend to use many types of data sources,
including large or unconventional datasets. Depending on an organization’s goals, it may
choose to embark on a BI project if it is doing reporting, creating dashboards, or
performing simple visualizations, or it may choose Data Science projects if it needs to do
a more sophisticated analysis with disaggregated or varied datasets.
1.2.2 Current Analytical Architecture
As described earlier, Data Science projects need workspaces that are purpose-built for
experimenting with data, with flexible and agile data architectures. Most organizations
still have data warehouses that provide excellent support for traditional reporting and
simple data analysis activities but unfortunately have a more difficult time supporting
more robust analyses. This section examines a typical analytical data architecture that may
exist within an organization.
Figure 1.9
shows a typical data architecture and several of the challenges it presents to
data scientists and others trying to do advanced analytics. This section examines the data
flow to the Data Scientist and how this individual fits into the process of getting data to
analyze on projects.

لخّصلي

نتيجة التلخيص (32%)

النص الأصلي

تلخيص النصوص العربية والإنجليزية أونلاين

آخر التلخيصات