Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

What is Big Data

...

You have seen some applications of big data and you have learned the data science process.Professor Norman White (@normwhiteis the Faculty Director at the Stern Center for Research Computing at New York University. The following readings are from his blog: researchcomputing.blogspot.ca where he comments on prevailing Big Data related events. The posts are much like diary entries and reflect what was happening at different points from 2011 to 2015. 

Big Data Use Cases

...

Widget Connector
width1000
urlhttps://www.youtube.com/watch?v=gXwhq5iuoG0
height600

In this lesson, we will look at some use cases for big data, and we will see how big data is adding value in business.

We'll be looking at big data exploration to find, visualize and understand big data to improve business knowledge. We will learn the concept of the enhanced 360-degree view, this is a way of looking at the customer to achieve a true unified view, incorporating internal and external data sources.

We will explore the concept of security and intelligence extension, to lower risk, detect fraud, and monitor cyber security in real-time.

We will look at operations analysis to analyze a variety of machine data to improve business results. Big data exploration addresses the challenge faced by every large organization. Business information is spread across multiple systems and silos, big data exploration enables you to explore, and my big data to find, visualize and understand all your data, to improve decision making.

By creating a unified view of information across all data sources, both inside and outside of your organization, you gain enhanced value and new insights.

Let's look at a transportation example. By using data from different systems such as cameras at different points in a city, weather information, and GPS data from Uber, taxis, trucks, and cars, we can predict traffic at a faster and more accurate pace to deploy real-time, smarter traffic systems that improve traffic flow. There are many positive benefits from this, including reduced fuel emissions, public transportation planning, and longer-lasting transportation infrastructure. With the advent of self-driving cars, machine learning algorithms can be trained using historical and real-time data from human-driven cars on the road, this would teach the driverless car how real drivers behaved in different traffic situations in varying weather conditions and circumstances. In the digital era, the touch points between an organization and its customers have increased many times over, organizations now require specialized solutions to effectively manage these connections.

An enhanced 360-degree view of the customer is a holistic approach, that takes into account all available and meaningful information about the customer to drive better engagement, revenue, and long-term loyalty.

This is the basis for modern customer relationship management or CRM systems. Let's look at an example in detail. By taking an enhanced 360-degree view of the customer, and making available and meaningful information such as spending habits, shopping behavior and preferences, grocery stores are able to plan, prepare, and provide better services to customers. The growing number of high-tech crimes, cyber-based terrorism, espionage, computer intrusions, and major cyber fraud cases, poses a real threat to every individual and organization.

To meet these security challenges, businesses are using big data technologies to change and enhance their cybersecurity and intelligence activities, how? By processing and analyzing new data types, such as social media, emails, and analyzing hours and hours of video footage.

Analyzing data in motion, and at rest, can help find new associations, or uncover patterns and facts to significantly improve intelligence, security, and law enforcement. Operations analysis focuses on analyzing machine data, which can include anything from signals, sensors, and logs, to data from GPS devices.

This type of data is growing at an exponential rate, and it comes in large volume, and a variety of formats. Using big data for operations analysis, organizations can gain real-time visibility into operations, customer experience, transactions, and behavior.

Big data empowers businesses to predict when a machine will stop working when machine components need to be replaced, and even when employees will resign.

Let's look at an example. Airplane engines generate large amounts of data every second.

By analyzing this massive amount of data from the turbine, and even other sensors on the plane such as GPS, temperature, and speed, organizations are able to gain real-time visibility into the operations of the plane.

This data is used to run the aircraft safely and efficiently, and in the unlikely event of a crash, this data can also tell air crash investigators exactly what caused the accident.

Many present-day aviation regulations and protocols have come from the data collected in past incidents. Personalized recommendations. Walmart consistently sends tailored product offers based on customer behavior, online and in stores.

Walmart has also enjoyed success in email marketing campaigns by optimizing the time that offers are sent. Walmart tracks each campaign's open rate, and realigns delivery times based on individual user patterns

You have seen some use cases for big data, and how big data is adding value to a business.

Processing Big Data

...

Widget Connector
width1000
urlhttps://www.youtube.com/watch?v=N-oBYZTG9Ks
height600

In this lesson, you will learn about Big Data processing technologies.

You will learn about Hadoop, what it is and why it is considered a great Big Data solution.

In a report by the McKinsey Global Institute from 2011, the main components and ecosystems are out outlined as follows:

  • Techniques for Analyzing Data, such as A/B Testing, Machine Learning, and Natural Language Processing.
  • Big Data Technologies like Business Intelligence, Cloud Computing, and Databases.
  • Visualization such as Charts, Graphs, and Other Displays of the data.

The Big Data processing technologies we will discuss work to bring large sets of structured and unstructured data into a format where analysis and visualization can be conducted. Value can only be derived from Big Data if it can be reduced or repackaged into formats that can be understood by people.

One trend making the Big Data revolution possible is the development of new software tools and database systems such as Hadoop, HBase, and NoSQL for large, un-structured data sets. There are a number of vendors that offer Big Data processing tools and Big Data education.

We'll start with IBM, who host Big Data University and Data Scientist Workbench. Data Scientist Workbench is a cloud hosted collection of open source tools such as OpenRefine, Jupyter Notebooks, Zeppelin Notebooks, and RStudio. It provides easy access to Spark, Hadoop, and a variety of other Big Data analytic engines, in addition to programming languages such as Python, R, and Scala.

So what is the Hadoop framework? Hadoop is an open-source software framework used to store and process huge amounts of data. It is implemented in several distinct, specialized modules:

Storage, principally employing the Hadoop File System, or HDFS, Resource management and scheduling for computational tasks, Distributed processing programming models based on MapReduce, Common utilities and software libraries necessary for the entire Hadoop platform.

Hadoop is a framework written in Java, originally developed by Doug Cutting who named it after his son's toy elephant. Hadoop uses Google's MapReduce technology as its foundation. Let's review some of the terminology used in any Hadoop discussion.

A node is simply a computer. This is typically non-enterprise, commodity hardware that contains data. So in this example, we have node one, then we can add more nodes such as node two, node three, and so on. This would be called a rack. A rack is a collection of 30 or 40 nodes that are physically stored close together and are all connected to the same network switch. Network bandwidth between any two nodes in a rack is greater than bandwidth between two nodes on different racks.

The Hadoop Cluster is a collection of racks. IBM analytics defines Hadoop as follows, Apache Hadoop is a highly scalable storage platform designed to process very large data sets across hundreds to thousands of computing nodes that operate in parallel. It provides a cost effective storage solution for large data volumes with no format requirements. MapReduce, the programming paradigm that allows for this massive scalability, is the heart of Hadoop.

Why Hadoop? According to IBM analytics, some companies are delaying data opportunities because of organizational constraints. Others are not sure what distribution to choose and still others simply can't find time to mature their Big Data delivery due to the pressure of day to day business needs. The smartest Hadoop strategies start with choosing recommended distributions, then maturing the environment with modernized hybrid architectures, and adopting a data lake strategy based on Hadoop technology. Data lakes are a method of storing data that keep vast amounts of raw data in their native format and more horizontally to support the analysis of originally disparate sources of data. Big Data is best thought of as a platform rather than a specific set of software.

Data Warehouses are part of a Big Data platform. They deliver deep insight with advanced in-database analytics and operational analytics.

Data Warehouses provide online analytic processing, or OLAP. Data Warehouse Modernization, formerly known as Data Warehouse Augmentation, is about building on an existing Data Warehouse infrastructure leveraging Big Data technologies to augment its capabilities, essentially in upgrade. Given a set of data, there are three key types to Data Warehouse Modernization's.

Pre-Processing, using Big Data as a landing zone before determining what data should be moved to the Data Warehouse. It could be categorized as Irrelevant Data or Relevant Data, which would go to the Data Warehouse. Offloading, moving infrequently accessed data from Data Warehouses into enterprise grade Hadoop.

Exploration, using big data capabilities to explore and discover new high value data from massive amounts of raw data and free up the Data Warehouse for more structured deep analytics.

You have learned what Hadoop is and why it is a great solution for Big Data.

Course Summary

...

Widget Connector
width1000
urlhttps://www.youtube.com/watch?v=WvQv6Ps4EQY
height600

This has been an Introduction to Big Data. This course was designed to introduce Big Data ideas and terminology.

We covered these learning objectives:

  • Definitions of Big Data,
  • Characteristics of Big Data,
  • A strategy for Big Data,
  • Big Data use cases,
  • and Big Data systems.

We looked at a definition of Big Data by Bernard Marr. Big Data is the digital trace that we are generating in this digital era. This digital trace is made up of all the data that is captured when using digital technology.

The driving forces in this brave new world are access to ever-increasing volumes of data, and our ever-increasing technological ability to mine that data for commercial insights.

We looked at the characteristics of Big Data, such as volume, velocity, veracity, variety, and value. And we looked at the sources of the data. We discussed the differences between Structured and Unstructured Data.

Structured Data is data that is organized, labeled, and follows a strict model. Examples are spreadsheets or relational databases.

Unstructured Data makes up about 80% of data. Examples are text or video. Unstructured Data does not have a predefined model, and it is not organized in a formal way.

Semi-Structured Data is found in between the latter two data types. We explored a number of Big Data use cases, and we discussed the skills required for Data Science Practitioners. We defined Data Science as the process of distilling insights from data to inform decisions. And we looked at the impact of Big Data in business.

We also looked at Big Data systems, and we focused on Hadoop as a Big Data solution. Hadoop is an open-source software framework used to store and process huge amounts of data. It is implemented in several distinct, specialized modules.

1. Storage, principally employing the Hadoop File System, or HDFS.

2. Resource management and scheduling for computational tasks.

3. Distributed processing programming model based on MapReduce.

4. Common utilities and software libraries necessary for the entire Hadoop platform.

You have completed the Introduction to Big Data. Big Data University has many more free courses, and you can learn more about any of the different topics covered in this course.

...

Professor Norman White (@normwhiteis the Faculty Director at the Stern Center for Research Computing at New York University. The following readings are from his blog: researchcomputing.blogspot.ca where he comments on prevailing Big Data related events. The posts are much like diary entries and reflect what was happening at different points from 2011 to 2015. 

  1. http://researchcomputing.blogspot.com/2011/10/big-data-and-business-analytics-comes.html
  2. http://researchcomputing.blogspot.com/2011/04/facebook-joins-google-in-hpc-computing.html
  3. http://researchcomputing.blogspot.com/2012/12/climate-change-and-big-data.html
  4. http://researchcomputing.blogspot.com/20112013/1001/big-data-and-business-analytics-comessensors.html
  5. http://researchcomputing.blogspot.com/2011/0406/facebookhadoop-joinsand-googlelustre-insome-hpc-computingthoughts.html