Big Data

What is Big Data

Bernard Marr defines Big Data as the digital trace that we are generating in this digital era. This digital trace is made up of all the data that is captured when we use digital technology. The basic idea behind the phrase Big Data is that everything we do is increasingly leaving a digital trace which we can use and analyze to become smarter. The driving forces in this brave new world are access to ever-increasing volumes of data and our ever-increasing technological capability to mine that data for commercial insights.

The research firm Gartner defines Big Data as follows:

Big Data is high-volume, high-velocity, and/or high variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. Ernst and Young offer the following definition:

Big Data refers to the dynamic, large and disparate volumes of data being created by people, tools and machines. It requires new, innovative, and scalable technology to collect, host and analytically processes the vast amount of data gathered in order to derive real-time business insights that relate to consumers, risk, profit, performance, productivity management and enhanced shareholder value. Lisa Arthur, a Forbes contributor, defines Big Data as a collection of data from traditional and digital sources inside and outside a company that represents a source of ongoing discovery and analysis. There is no one definition of Big Data, but there are certain elements that are common across the different definitions, such as velocity, volume, variety, and veracity.

These are the V's of Big Data.

Velocity is the speed of the data or the speed at which data accumulates.
Volume is the scale of the data or the increase in the amount of data stored.
Variety is the diversity of the data. We have structured data that fits neatly into rows and columns, or relational databases and unstructured data that is not organized in a pre-defined way, for example, Tweets, blog posts, pictures, numbers, and even video data.
Veracity is the conformity to facts and accuracy. With a large amount of data available, the debate rages on about the accuracy of data in the digital era. Is the information real, or is it false?

Let's unpack the V's even further.

Velocity is the idea that data is being generated extremely fast, a process that never stops. Attributes include near or real-time streaming and local and cloud-based technologies that can process information very quickly.

Volume is the amount of data generated. For example, exabytes, zettabytes, yottabytes, etc.. Drivers of a volume are the increase in data sources, higher resolution sensors and scalable infrastructure.

Veracity is the quality and origin of data. Attributes include consistency, completeness, integrity, and ambiguity. Drivers include cost and the need for traceability.

Variety is the idea that data comes from different sources, machines, people, processes, both internal and external to organizations. Attributes include the degree of structure and complexity and drivers are mobile technologies, social media, wearable technologies, geo-technologies,
the video, and many, many more.

And the last V is value. Let's look at some examples of the V's in action.

Velocity: Every 60 seconds, hours of footage are uploaded to YouTube. This amount of data is generated every minute. So think about how much accumulates over hours, days, and in years.

Volume: Every day we create approximately 2.5 quintillion bytes of data. That's 10 million Blu-ray DVD's every day. The world population is approximately seven billion people, and the vast majority of people are now using digital devices. These devices all generate, capture, and store data. And with more than one device, for example, mobile devices, desktop computers, laptops, et cetera, we're seeing even more data being produced.

Variety: Let's think about the different types of data, text, pictures, and film. What about sound, health data from wearable devices, and many different types of data from devices connected to the internet of things.

Veracity: 80% of data is considered to be unstructured and we must devise ways to produce reliable and accurate insights. The data must be categorized, analyzed and visualized. The emerging V is value. This V refers to our ability and need to turn data into value.

Value isn't just profit. It may be medical or social benefits, or customer, employee, or personal satisfaction. The main reasons why people invest time to understand Big Data is to derive value from it.

Big Data In Bussiness

In this lesson, we will provide you with an overview of Big Data. And you will learn how to get value from it. We will cover the terms, the concepts and the technologies and what has led to the big data era.

Many of us are generating and using big data without being aware that we are.

How is big data impacting business and people? Have you ever searched for or bought a product on Amazon?

Did you notice that Amazon started making recommendations related to the product you searched for?

Recommendation engines are a common application of big data. Companies like Amazon, Netflix and Spotify use algorithms based on big data to make specific recommendations based on customer preferences and historical behavior.

Personal assistants like Siri on Apple devices use big data to devise answers to the infinite number of questions end users may ask.

Google now makes recommendations based on the big data on a user's device. Now that we have an idea of how consumers are using big data, let's take a look at how big data is impacting business.

In 2011, McKinsey & Company said that big data was going to become the key basis of competition supporting new waves of productivity growth and innovation.

In 2013, UPS announced that it was using data from customers, drivers and vehicles in a new route guidance system aimed to save time, money and fuel. Initiatives like this one support the statement that big data will fundamentally change the way businesses compete and operate.

How does a firm gain a competitive advantage?

Have you ever heard of the Netflix show called House of Cards?

The first season of the show was released in 2013 and it was an immediate hit. At the time, the New York Times reported that Netflix executives knew that House of Cards would be a hit before they even filmed it, but how do they know that?

Big data.

Netflix has a lot of data. Netflix knows the time of day when movies are watched. It logs when users pause, rewind and fast forward. It has ratings from millions of users as well as the information on searches they make.

By looking at all these big data, Netflix knew many of its users had streamed the work of David Fincher and films featuring Kevin Spacey had always done well. And it knew that the British version of House of Cards had also done well. It also knew that people who liked Fincher also liked Spacey.

All these information suggested that buying the series would be a good bet for the company, and in fact it was.\ In other words, thanks to big data, Netflix knows what people want before they do. Now let's review another example. Market saturation and selective customers will require Chinese e-commerce companies to make better use of big data in order to gain market share.

Companies will have to persuade customers to shop more frequently, to make larger purchases and to buy from a broader array of online shopping categories.

E-commerce players already have the tools to do this as digital shopping grows. Leading players are already using data to build models aimed at boosting retention rates and spending per customer based on e-commerce data.

They have also started to adopt analytics backed pricing and promotional activities. The Internet of Things refers to the exponential rise of connected devices. IoT suggests that many different types of data today products will be connected to a network or to the internet for example refrigerators,
coffee machines or pillows. Another connection of IoT is called wearables and it refers to items of clothing or things we wear that are now connected.

These items include Fitbits, Apple Watches or the new Nike running shoes that tie their own shoelaces.

You have seen some of the characteristics of big data and you have seen some of the applications.

Beyond the Hype

In this lesson, we will look at some examples of Big Data and how it is being generated. We will discuss sources of Big Data and the different types of Big Data. So why is everyone talking about Big Data?

More data has been created in the past two years than in the entire history of humankind. By 2020, about 1.7 megabytes of new information will be created every second for every human being in the world.

By 2020, the data we create and copy will reach around 35 zettabytes, up from only 7.9 zettabytes today. The chart on the right shows the growth in global data in zettabytes. Note the jump from 2015 to 2020 of 343%.

How big is a zettabyte? One bit is binary. It's either a one or a zero. Eight bits make up one byte, and 1024 bytes make up one kilobyte. 1024 kilobytes make up one megabyte. Large videos and DVDs will be in gigabytes where 1024 megabytes make up one gigabyte of storage space. These days we have USBs or memory sticks that can store a few dozen gigabytes of information where computers and hard drives now store terabytes of information.

One terabyte is 1025 gigabytes. 1024 terabytes make up one petabyte, and 1024 petabytes make up an exabyte. Think of a big urban city or a busy international airport like Heathrow, JFK, O'Hare, Dubai, or O. R. Tambo in Johannesburg.

And now we're talking petabytes and exabytes. All those airplanes are capturing and transmitting data. All the people in those airports have mobile devices. Also consider the security cameras and all the staff in and around the airport.

A digital universe study conducted by IDC claimed digital information reached 0.8 zettabytes last year and predicted this number would grow to 35 zettabytes by 2020. It is predicted that by 2020, one tenth of the world's data will be produced by machines, and most of the world's data will be produced in emerging markets. It is also predicted that the amount of data produced will increasingly outpace available storage. Advances in cloud computing have contributed to the increasing potential of Big Data. According to McKinsey in 2013, the emergence of cloud computing has highly contributed to the launch of the Big Data era.

Cloud computing allows users to access highly scalable computing and storage resources through the internet. By using cloud computing, companies can use server capacity as needed and expand it rapidly to the large scale required to process big data sets and run complicated mathematical models.

Cloud computing lowers the price to analyze big data as the resources are shared across many users, who pay only for the capacity they actually utilize. A survey by IBM and SAID Business School identified three major sources of Big Data. People-generated data, machine-generated data, and business-generated data, which is the data that organizations generate within their own operations.

The chart on the right shows different responses where responders were allowed to select multiple answers. Big Data will require analysts to have Big Data skills. Big Data skills include discovering and analyzing trends that occur in Big Data.

Big Data comes in three forms. Structured, unstructured, and semi-structured.

Structured data is data that is organized, labelled, and has a strict model that it follows.
Unstructured data is said to make up about 80% of data in the world, where the data is usually in a text form and does not have a predefined model or is organized in any way.
And semi-structured data is a combination of the two. It is similar to structured data, where it may have an organized structure, but lacks a strictly-defined model. Some sources of structured Big Data are relational databases and spreadsheets.

With this type of structure, we know how data is related to other data, what the data means, and the data is easy to query, using a programming language like SQL. Some sources of semi-structured Big Data are XML and JSON files.

These sources use tags or other markers to enforce hierarchies of records and fields within data. A large multi-radio telescope project called Square Kilometer Array, or SKA, produced about 1000 petabytes, in 2011 at least, of raw data a day.

It is projected that it will produce about 20,000 petabytes or 20 billion gigabytes of data each day in 2020.

Currently, there is an explosion of data coming from internet activity and in particular, video production and consumption as well as social media activities. These numbers will just keep growing as internet speeds increase and as more and more people all over the world have access to the internet.

Structured data refers to any data that resides in a fixed field within a record or file. It has the advantage of being easily entered, stored, queried, and analyzed.

In today's business setting, most Big Data generated by organizations is structured and stored in data warehouses.

Highly structured business-generated data is considered a valuable source of information and thus equally important as machine and people-generated data.

Big Data and Data Science

In this lesson, we will look at how big data relates to data science.

We will look at the skills data scientists should have and we will look at what's involved in the data science process.

When we look at big data, we can start with a few broad topics: integration, analysis, visualization, optimization, security, and governance.

Let's start off with a quick definition of integration. To integrate means to bring together or incorporate parts into a whole.

In big data, it would be ideal to have one platform to manage all of the data, rather than individual silos, each creating separate silos of insight.

Big data has to be bigger than just one technology or one enterprise solution which was built for one purpose.

For example, a bank should be thinking about how to integrate its retail banking, its commercial banking, and investment banking.

One way to be bigger than one technology is to use Hadoop when dealing with big data.

A Hadoop distributed file system, or HDFS, stores data for many different locations, creating a centralized place to store and process the data.

Many large companies make use of Hadoop in their technologies. Analysis.

Let's look at a Walmart example. Walmart utilizes a search engine called Polaris, which helps shoppers search for products they wish to buy. It takes into account how a user is behaving on the website in order to surface the best results for them.

Polaris will bring up items that are based on a user's interests and, because many consumers visit Walmart's website, large amounts of data are collected, making the analysis on that big data very important. Visualization.

Some people work well with tables of data, however, the vast majority of people need big data to be presented to them in a graphical way so they can understand it.

Data visualization is helpful to people who need to analyze the data, like analysts or data scientists, and it is especially useful to non-technical people who need to make decisions from data, but don't work with it on a daily basis.

An example of visualizing big data is in displaying the temperature on a map by region. By using the massive amounts of data collected by sensors and satellites in space, viewers can get a quick and easy summary of where it's going to be hot or cold.

Security and governance. Data privacy is a critical part of the big data era. Business and individuals must give great thought to how data is collected, retained, used, and disclosed. A privacy breach occurs when there is unauthorized access to or collection, use, or disclosure of personal information and, often, this leads to litigation.

Companies must establish strict controls and privacy policies in compliance with the legal framework of the geographic region they are in.

Big data governance requires three things: automated integration, that is, easy access to the data wherever it resides, visual content, that is, easy categorization, indexing, and discovery within big data to optimize its usage, agile governance is the definition and execution of governance appropriate to the value of the data and its intended use. Looking at these three things provides businesses with a quick way to profile the level of importance of the data and, therefore, the amount of security required to protect it.

Some of the applications used in big data are Hadoop, Oozie, Flume, Hive, HBase, Apache Pig, Apache Spark, MapReduce and YARN, Sqoop, ZooKeeper, and text analytics. We need people with the skills to run these applications and analyze big data.

Big Data University offers free courses on Hadoop, machine learning, analytics, Spark, and much more. Look up Big Data Dudes to learn more about Spark and big data. There are many MOOCs, or massive open online courses, and some formal programs in big data, too. Data science is the process of cleaning, mining, and analyzing data to derive insights of value from it. In data science, the size of the data is less important. One can use data of all sizes, small, medium, and big data that is related to a business or scientific case. Insights are extracted through a combination of exploratory data analysis and modeling. Data science is the process of distilling insights from data to inform decisions. A data scientist is a person who is qualified to derive insights from data by using skills and experience from computer science, business, or science, and statistics.

Here are more skills that a data scientist must have. One can use the following process to make sense of big data.

Determine problem.
- What is the business problem?
- What is the project objective?
- What would you do if you had all the data?
Collect data.
- Which data is relevant?
- Are there any privacy issues?
Explore the data.
- Plot the data.
- Are there any patterns?
Analyze the data.
- Build a model.
- Fit the model.
- Validate the model.
Storytelling. Visualization plus communication.
- Can we tell a story?
Take action and make decisions.

You have seen some applications of big data and you have learned the data science process.

Big Data Use Cases

In this lesson, we will look at some use cases for big data, and we will see how big data is adding value in business.

We'll be looking at big data exploration to find, visualize and understand big data to improve business knowledge. We will learn the concept of the enhanced 360-degree view, this is a way of looking at the customer to achieve a true unified view, incorporating internal and external data sources.

We will explore the concept of security and intelligence extension, to lower risk, detect fraud, and monitor cyber security in real-time.

We will look at operations analysis to analyze a variety of machine data to improve business results. Big data exploration addresses the challenge faced by every large organization. Business information is spread across multiple systems and silos, big data exploration enables you to explore, and my big data to find, visualize and understand all your data, to improve decision making.

By creating a unified view of information across all data sources, both inside and outside of your organization, you gain enhanced value and new insights.

Let's look at a transportation example. By using data from different systems such as cameras at different points in a city, weather information, and GPS data from Uber, taxis, trucks, and cars, we can predict traffic at a faster and more accurate pace to deploy real-time, smarter traffic systems that improve traffic flow. There are many positive benefits from this, including reduced fuel emissions, public transportation planning, and longer-lasting transportation infrastructure. With the advent of self-driving cars, machine learning algorithms can be trained using historical and real-time data from human-driven cars on the road, this would teach the driverless car how real drivers behaved in different traffic situations in varying weather conditions and circumstances. In the digital era, the touch points between an organization and its customers have increased many times over, organizations now require specialized solutions to effectively manage these connections.

An enhanced 360-degree view of the customer is a holistic approach, that takes into account all available and meaningful information about the customer to drive better engagement, revenue, and long-term loyalty.

This is the basis for modern customer relationship management or CRM systems. Let's look at an example in detail. By taking an enhanced 360-degree view of the customer, and making available and meaningful information such as spending habits, shopping behavior and preferences, grocery stores are able to plan, prepare, and provide better services to customers. The growing number of high-tech crimes, cyber-based terrorism, espionage, computer intrusions, and major cyber fraud cases, poses a real threat to every individual and organization.

To meet these security challenges, businesses are using big data technologies to change and enhance their cybersecurity and intelligence activities, how? By processing and analyzing new data types, such as social media, emails, and analyzing hours and hours of video footage.

Analyzing data in motion, and at rest, can help find new associations, or uncover patterns and facts to significantly improve intelligence, security, and law enforcement. Operations analysis focuses on analyzing machine data, which can include anything from signals, sensors, and logs, to data from GPS devices.

This type of data is growing at an exponential rate, and it comes in large volume, and a variety of formats. Using big data for operations analysis, organizations can gain real-time visibility into operations, customer experience, transactions, and behavior.

Big data empowers businesses to predict when a machine will stop working when machine components need to be replaced, and even when employees will resign.

Let's look at an example. Airplane engines generate large amounts of data every second.

By analyzing this massive amount of data from the turbine, and even other sensors on the plane such as GPS, temperature, and speed, organizations are able to gain real-time visibility into the operations of the plane.

This data is used to run the aircraft safely and efficiently, and in the unlikely event of a crash, this data can also tell air crash investigators exactly what caused the accident.

Many present-day aviation regulations and protocols have come from the data collected in past incidents. Personalized recommendations. Walmart consistently sends tailored product offers based on customer behavior, online and in stores.

Walmart has also enjoyed success in email marketing campaigns by optimizing the time that offers are sent. Walmart tracks each campaign's open rate, and realigns delivery times based on individual user patterns

You have seen some use cases for big data, and how big data is adding value to a business.

Processing Big Data

In this lesson, you will learn about Big Data processing technologies.

You will learn about Hadoop, what it is and why it is considered a great Big Data solution.

In a report by the McKinsey Global Institute from 2011, the main components and ecosystems are out outlined as follows:

Techniques for Analyzing Data, such as A/B Testing, Machine Learning, and Natural Language Processing.
Big Data Technologies like Business Intelligence, Cloud Computing, and Databases.
Visualization such as Charts, Graphs, and Other Displays of the data.

The Big Data processing technologies we will discuss work to bring large sets of structured and unstructured data into a format where analysis and visualization can be conducted. Value can only be derived from Big Data if it can be reduced or repackaged into formats that can be understood by people.

One trend making the Big Data revolution possible is the development of new software tools and database systems such as Hadoop, HBase, and NoSQL for large, un-structured data sets. There are a number of vendors that offer Big Data processing tools and Big Data education.

We'll start with IBM, who host Big Data University and Data Scientist Workbench. Data Scientist Workbench is a cloud hosted collection of open source tools such as OpenRefine, Jupyter Notebooks, Zeppelin Notebooks, and RStudio. It provides easy access to Spark, Hadoop, and a variety of other Big Data analytic engines, in addition to programming languages such as Python, R, and Scala.

So what is the Hadoop framework? Hadoop is an open-source software framework used to store and process huge amounts of data. It is implemented in several distinct, specialized modules:

Storage, principally employing the Hadoop File System, or HDFS, Resource management and scheduling for computational tasks, Distributed processing programming models based on MapReduce, Common utilities and software libraries necessary for the entire Hadoop platform.

Hadoop is a framework written in Java, originally developed by Doug Cutting who named it after his son's toy elephant. Hadoop uses Google's MapReduce technology as its foundation. Let's review some of the terminology used in any Hadoop discussion.

A node is simply a computer. This is typically non-enterprise, commodity hardware that contains data. So in this example, we have node one, then we can add more nodes such as node two, node three, and so on. This would be called a rack. A rack is a collection of 30 or 40 nodes that are physically stored close together and are all connected to the same network switch. Network bandwidth between any two nodes in a rack is greater than bandwidth between two nodes on different racks.

The Hadoop Cluster is a collection of racks. IBM analytics defines Hadoop as follows, Apache Hadoop is a highly scalable storage platform designed to process very large data sets across hundreds to thousands of computing nodes that operate in parallel. It provides a cost effective storage solution for large data volumes with no format requirements. MapReduce, the programming paradigm that allows for this massive scalability, is the heart of Hadoop.

Why Hadoop? According to IBM analytics, some companies are delaying data opportunities because of organizational constraints. Others are not sure what distribution to choose and still others simply can't find time to mature their Big Data delivery due to the pressure of day to day business needs. The smartest Hadoop strategies start with choosing recommended distributions, then maturing the environment with modernized hybrid architectures, and adopting a data lake strategy based on Hadoop technology. Data lakes are a method of storing data that keep vast amounts of raw data in their native format and more horizontally to support the analysis of originally disparate sources of data. Big Data is best thought of as a platform rather than a specific set of software.

Data Warehouses are part of a Big Data platform. They deliver deep insight with advanced in-database analytics and operational analytics.

Data Warehouses provide online analytic processing, or OLAP. Data Warehouse Modernization, formerly known as Data Warehouse Augmentation, is about building on an existing Data Warehouse infrastructure leveraging Big Data technologies to augment its capabilities, essentially in upgrade. Given a set of data, there are three key types to Data Warehouse Modernization's.

Pre-Processing, using Big Data as a landing zone before determining what data should be moved to the Data Warehouse. It could be categorized as Irrelevant Data or Relevant Data, which would go to the Data Warehouse. Offloading, moving infrequently accessed data from Data Warehouses into enterprise grade Hadoop.

Exploration, using big data capabilities to explore and discover new high value data from massive amounts of raw data and free up the Data Warehouse for more structured deep analytics.

You have learned what Hadoop is and why it is a great solution for Big Data.

Course Summary

This has been an Introduction to Big Data. This course was designed to introduce Big Data ideas and terminology.

We covered these learning objectives:

Definitions of Big Data,
Characteristics of Big Data,
A strategy for Big Data,
Big Data use cases,
and Big Data systems.

We looked at a definition of Big Data by Bernard Marr. Big Data is the digital trace that we are generating in this digital era. This digital trace is made up of all the data that is captured when using digital technology.

The driving forces in this brave new world are access to ever-increasing volumes of data, and our ever-increasing technological ability to mine that data for commercial insights.

We looked at the characteristics of Big Data, such as volume, velocity, veracity, variety, and value. And we looked at the sources of the data. We discussed the differences between Structured and Unstructured Data.

Structured Data is data that is organized, labeled, and follows a strict model. Examples are spreadsheets or relational databases.

Unstructured Data makes up about 80% of data. Examples are text or video. Unstructured Data does not have a predefined model, and it is not organized in a formal way.

Semi-Structured Data is found in between the latter two data types. We explored a number of Big Data use cases, and we discussed the skills required for Data Science Practitioners. We defined Data Science as the process of distilling insights from data to inform decisions. And we looked at the impact of Big Data in business.

We also looked at Big Data systems, and we focused on Hadoop as a Big Data solution. Hadoop is an open-source software framework used to store and process huge amounts of data. It is implemented in several distinct, specialized modules.

1. Storage, principally employing the Hadoop File System, or HDFS.

2. Resource management and scheduling for computational tasks.

3. Distributed processing programming model based on MapReduce.

4. Common utilities and software libraries necessary for the entire Hadoop platform.

You have completed the Introduction to Big Data. Big Data University has many more free courses, and you can learn more about any of the different topics covered in this course.

Professor Norman White (@normwhite) is the Faculty Director at the Stern Center for Research Computing at New York University. The following readings are from his blog: researchcomputing.blogspot.ca where he comments on prevailing Big Data related events. The posts are much like diary entries and reflect what was happening at different points from 2011 to 2015.