Data Science

Chapter 1. The Bazaar of Storytellers

Don’t thank God it’s Friday, especially if you happen to be in Pakistan or in the troubled parts of the Middle East. For Muslims, Friday is supposed to be a day of rest and atonement. In Pakistan, though, it is anything but. Friday is the preferred day for suicide bombers to attack others in markets, mosques, and streets. Statistically speaking, the odds of one dying in a suicide bomb blast are much higher on Friday than on any other day.

As I write these words on September 18, 2015, a Friday, Taliban militants have stormed a Pakistan Air Force base in Badaber, a suburb of Peshawar, which is a fabled town in the northwest of Pakistan. The attack left 20 civilians and officers dead. Also killed were the 13 militants.¹

The Badaber Air Force Base used to be the epicenter of the CIA’s war against the Soviet Union in Afghanistan. You can catch a glimpse of the staged Badaber airbase in Steven Spielberg’s spy thriller, Bridge of Spies. You may also recall Tom Hanks playing Congressman Charlie Wilson in the 2007 movie Charlie Wilson’s War, which told the story of the CIA’s covert (it was an open secret) war against the Soviet Army in Afghanistan.² By 1988, the CIA and other American agencies, such as the USAID, left the region after the Soviets retreated from Afghanistan. Nevertheless, the war in Afghanistan continued and transformed into a civil war, which now threatens the State and society in both Afghanistan and Pakistan. For curious minds interested in knowing why the Islamic militancy has taken such hold in South and West Asia, they might want to watch the last five minutes of Charlie Wilson’s War.

The September 18 attack in Badaber by the Taliban militants is similar to other previous attacks on civil and military establishments. Oftentimes such attacks take place on a Friday, which as you know by now is the day of communal prayers in Islam. For militants, Friday is preferred for two reasons. The mosques on Friday are filled to capacity for the early afternoon prayers. The militants want to get the biggest bang for their “bang” and afternoon prayers on Friday look like the ideal time. At the same time, and despite the elevated risks on Friday, the security is routinely lax because even the police personnel are busy kneeling with the other believers!

I first discovered the elevated risk for Fridays in 2011 while analyzing the security data for Pakistan. I obtained data on terrorism in Pakistan from the South Asia Terrorism Portal.³ The 2010 terrorism incidents in Pakistan revealed a clear trend. The odds of suicide bombings were significantly higher for Friday than on any other weekday. “In 2010 alone, 43 percent of the 1,547 victims of bomb blasts were killed on a Friday. In Balochistan and Punjab, Fridays accounted for almost 60 percent of all bomb blast-related deaths.”⁴

The targeted Friday bombings have returned in 2015. Almost 40% of the fatalities from suicide bombings occurred on a Friday (see Figure 1.1). Thanks to data science and analytics, we are able to identify elevated risks and the order in what otherwise appears to be chaos. I relied on very simple tools to expose the trends in militancy that plagues Pakistan today. It involved importing unstructured data from a website and turning into a database. After the data were transformed in a structured format, I extracted the date for each attack, and tabulated the number of deaths and injured. I cleaned the data set and applied pivot tables to obtain a breakdown of incidents by year, month, and day of the week. In the end, I generated a graphic to highlight the key finding: Don’t thank God it’s Friday.

Figure 1.1 Weekday distribution of bomb blast victims in Pakistan

This book extolls the virtues of narratives supported by data-driven analytics. In a world awash with data, sophisticated algorithms, inexpensive storage and computing capabilities, analytics have emerged as the defining characteristics of smart firms and governments. The notion of “competing on analytics” is now well understood and appreciated.⁵

Data science will not be the exclusive dominion of computer scientists/engineers and statisticians. Millions trained or schooled in other disciplines will embrace data science and analytics to contribute to evidence-based planning in their respective fields. As millions will make the transition to data science, they will have to start somewhere. This book intends to be their first step on the path to becoming an established data scientist.

I believe data science and analytics are as much about storytelling as they are about algorithms. The ability to build a strong narrative from your empirical findings and then communicating it to the stakeholders will distinguish accomplished data scientists from those who would prefer to remain confined to coding and number crunching. This book makes a concerted effort to establish data science as an emerging field where data, analytics, and narrative blend to give life to stories that will help devise winning strategies in business, government, and not-for-profit sectors.

Even when I was a young child, I had an inkling that storytelling would play some part in my life. It was hard to miss those clues. For starters, my ancestral home in Peshawar was located in the Kissa Khawani Bazaar, which literally means the Bazaar of Storytellers. A labyrinth of small streets—some so narrow that if you were to stretch your arms wide, you could touch walls on either side—culminated in the grand bazaar.

Even before my ancestors arrived in Peshawar, there were no storytellers left in the Kissa Khawani Bazaar, which had morphed into an intense place of commerce with hundreds of matchbox-sized shops that prompted Peshawar’s former British Commissioner, Herbert Edwardes, to proclaim it the “Piccadilly of Central Asia.”

Although it’s no Bazaar of Storytellers, this book is full of stories based on the findings from simple and advanced analytics. This chapter does the following:

• Establishes the point that data science and data scientists will be in greater demand in the future than they already are. The demand for data scientists is unlikely to be met by the number of graduates being produced by the institutes of higher learning. Thus, switching to data science will be a prudent career move for those looking for new challenges at their work.

• Introduces readers to the distinguishing theme of this book: storytelling with analytics. I offer further details about the book’s structure and my approach to teaching data science.

• Addresses the controversies about data science and answers the question: who is, or is not, a data scientist. I also present a brief, yet critical, review of big data. This book is not infatuated with data sizes; instead, it focuses on how data is deployed to develop strategies and convincing narratives. Examples highlight how big data might have been mistakenly considered as the panacea for all problems and limitations.

• Provides answers to frequently asked questions about data science. The new and potential converts to data science have questions about what to learn, where to go, and what career options are available to them. I provide answers to these questions so that the readers might find their own path to data science.

DATA SCIENCE: THE SEXIEST JOB IN THE 21ST CENTURY

In the data-driven world, data scientists have emerged as a hot commodity. The chase is on to find the best talent in data science. Already, experts estimate that millions of jobs in data science might remain vacant for the lack of readily available talent. The global search for skilled data scientists is not merely a search for statisticians or computer scientists. In fact, the firms are searching for well-rounded individuals who possess the subject matter expertise, some experience in software programming and analytics, and exceptional communication skills.

Our digital footprint has expanded rapidly over the past 10 years. The size of the digital universe was roughly 130 billion gigabytes in 1995. By 2020, this number will swell to 40 trillion gigabytes.⁶ Companies will compete for hundreds of thousands, if not millions, of new workers needed to navigate the digital world. No wonder the prestigious Harvard Business Review called data science “the sexiest job in the 21st century.”⁷

A report by the McKinsey Global Institute warns of huge talent shortages for data and analytics. “By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.”⁸

Because the digital revolution has touched every aspect of our lives, the opportunity to benefit from learning about our behaviors is more so now than ever before. Given the right data, marketers can take sneak peeks into our habit formation. Research in neurology and psychology is revealing how habits and preferences are formed and retailers like Target are out to profit from it. However, the retailers can only do so if they have data scientists working for them. For this reason it is “like an arms race to hire statisticians nowadays,” said Andreas Weigend, the former chief scientist at Amazon.com.”⁹

There is still the need to convince the C-suite executives of the benefits of data and analytics. It appears that the senior management might be a step or two behind the middle management in being informed of the potential of analytics-driven planning. Professor Peter Fader, who manages at the Customer Analytics Initiative at Wharton, knows that executives reach the C-suite without having to interact with data. He believes that the real change will happen when executives are well-versed in data and analytics.¹⁰

SAP, a leader in data and analytics, reported from a survey that 92% of the responding firms in its sample experienced a significant increase in their data holdings. At the same time, three-quarters identified the need for new data science skills in their firms. Accenture believes that the demand for data scientists may outstrip supply by 250,000 in 2015 alone. A similar survey of 150 executives by KPMG in 2014 found that 85% of the respondents did not know how to analyze data. “Most organizations are unable to connect the dots because they do not fully understand how data and analytics can transform their business,” Alwin Magimay, head of digital and analytics for KPMG UK, said in an interview in May 2015.¹¹

Bernard Marr writing for Forbes also raises concerns about the insufficient analytics talent. “There just aren’t enough people with the required skills to analyze and interpret this information—transforming it from raw numerical (or other) data into actionable insights—the ultimate aim of any Big Data-driven initiative,” he wrote.¹² Bernard quotes a survey by Gartner of business leaders of whom more than 50% reported the lack of in-house expertise in data science.

Bernard reported on Walmart, which turned to crowdsourcing for its analytics need. Walmart approached Kaggle to host a competition for analyzing its proprietary data. The retailer provided sales data from a shortlist of stores and asked the competitors to develop better forecasts of sales based on promotion schemes.

Given the shortage of data scientists, the employers are willing to pay top dollars for the talent. Michael Chui, a principal at McKinsey, knows this too well. Data science “has become relevant to every company ... There’s a war for this type of talent,” he said in an interview.¹³ Take Paul Minton, for example. He was making $20,000 serving tables at a restaurant. He had majored in math at college. Mr. Minton took a three-month programming course that changed everything. He made over $100,000 in 2014 as a data scientist for a web startup in San Francisco. “Six figures, right off the bat ... To me, it was astonishing,” said Mr. Minton.¹⁴

Could Mr. Minton be exceptionally fortunate, or are such high salaries the norm? Luck had little to do with it; the New York Times reported $100,000 as the average base salary of a software engineer and $112,000 for data scientists.

Given that the huge demand for data scientists is unlikely to be met by universities and colleges, alternatives are popping up all over the place. In the United States, one such private enterprise is Galvanize. Its data science course runs for 12 weeks and Galvanize claims the median salary of its graduates to be $115,000.¹⁵ Numerous other initiatives, including MOOCs, are adding data science courses in a hurry.

Realizing the demand sooner than other universities, North Carolina University launched a Master’s in Analytics degree in 2007. Michael Rappa, director of the Institute for Advanced Analytics informed, the New York Times that each one of the 84 graduates of the class of 2012 received a job offer. Those without experience on average earned $89,000 and the experienced graduate netted more than $100,000.¹⁶

In 2014, I taught an MBA course in data science and analytics. It was an elective course with a small class of approximately 12 students. One student in particular, whose undergraduate degree was in nursing, was enthralled by the subject and made extra effort to follow the course and learn to code in R. He obtained a management internship at a Toronto hospital, which a few months later matured into a fulltime job offer. The jubilant former student sent me an email. Here is what he wrote:

I just got a job as a Senior Decision Support Consultant at Trillium Health Partners for a six-figure salary! And the secret was data-mining!

I never gave up on learning R! I was able to create multiple projects using ML and NLP working in my previous job ... And what I have done stunned the employer, and they gave me the job two days after I showed them my codes.

I just want to say thank you for introducing me to this wonderfully fruitful and interesting field.

Isn’t it wonderful when it really works!

STORYTELLING AT GOOGLE AND WALMART

The key differentiator of this book is the concerted focus on turning data-driven insights into powerful narratives or stories that would grab the attention of all stakeholders and compel them to listen. This is not to say that statistical analysis and programming are not important. They indeed are. However, these are certainly not sufficient for a successful data scientist. Without storytelling abilities, you might be a computer scientist or a statistician, but not a data scientist.

Tom Davenport, a best-selling author of books on analytics, believes in telling stories with data. In a recent posting, he lists five reasons to explain why analytics-based stories are important and four reasons why so many organizations either do it badly or not at all.

I reproduce verbatim his five reasons for why storytelling is paramount:¹⁷

1. Stories have always been effective tools to transmit human experience; those that involve data and analysis are just relatively recent versions of them. Narrative is the way we simplify and make sense of a complex world. It supplies context, insight, interpretation—all the things that make data meaningful and analytics more relevant and interesting.

2. With analytics, your goal is normally to change how someone makes a decision or takes an action. You’re attempting to persuade, inspire trust, and lead change with these powerful tools. No matter how impressive your analysis is, or how high-quality your data, you’re not going to compel change unless the stakeholders for your work understand what you have done. That may require a visual story or a narrative one, but it does require a story.

3. Most people can’t understand the details of analytics, but they do want evidence of analysis and data. Stories that incorporate data and analytics are more convincing than those based on anecdotes or personal experience. Perhaps the most compelling stories of all are those that combine data and analytics, and a point of view or example that involves real people and organizations.

4. Data preparation and analysis often take time, but we need shorthand representations of those activities for those who are spectators or beneficiaries of them. It would be time-consuming and boring to share all the details of a quantitative analysis with stakeholders. Analysts need to find a way to deliver the salient findings from an analysis in a brief, snappy way. Stories fit the bill.

5. As with other types of stories, there are only a few basic types; couching our analytical activities in stories can help to standardize communications about them and spread results. It has been argued that there are only seven basic plots in all of literature. I once argued that there are ten types of analytical stories. Regardless of the number, if an organization is clear on the different types of stories that can be told with data and analytics, it makes it more likely that analysts will explore different types over time. Most importantly, the story repertoire should go well beyond basic “here’s what happened” reporting stories.

Mr. Davenport believes that most quantitative analysts are poor storytellers to begin with. They are introverts, more comfortable with machines and numbers than with humans. They are not taught storytelling at schools and are encouraged to focus even more on empirical disciplines. Some analysts might consider storytelling less worthy of a task than coding. Developing and telling a good story takes time, which the analysts may not be willing to spare.

Storytelling is equally important to the biggest big data firm in the world. “Google has a very data-led culture. But we care just as much about the storytelling...” said Lorraine Twohill, who served as Google’s senior vice president of global marketing in 2014.¹⁸ Twohill believes “getting the storytelling right—and having the substance and the authenticity in the storytelling—is as respected internally as [is] the return and the impact.”

In fact, Ms. Twohill sees storytelling gaining even more importance in the world of sophisticated tools. She is concerned that the data science world is too narrowly focused on “data” and “science,” and overlooking the primary objective of the entire exercise: storytelling. She warns, “if you fail on the messaging and storytelling, all that those tools will get you are a lot of bad impressions.” I couldn’t have said it any better myself.

If the world’s largest big data company is singing praise of storytelling, so is the world’s largest retailer. Mandar Thakur is a senior recruiter for IT talent at Walmart. His job is to hunt for data science talent to meet the growing needs of Walmart, which has used analytics and logistics as its competitive advantage. “The Kaggle competition created a buzz about Walmart and our analytics organization. People always knew that Walmart generates and has a lot of data, but the best part was that this let people see how we are using it strategically,” Mr. Thakur told Forbes in 2015 (Marr, 2015).

He also believes communication and storytelling are key to data scientist’s success. Again, it does not mean that analytics and algorithms don’t matter. Certainly, data science competency is a must as Mr. Thakur puts it: “we need people who are absolute data geeks: people who love data, and can slice it, dice it and make it do what they want it to do.” He sees communication and presentation skills as the great differentiator. “...[T]here is one very important aspect we look for, which perhaps differentiates a data analyst from other technologists. It exponentially improves their career prospects if they can match this technical, data-geek knowledge with great communication and presentation skills,” said Mr. Thakur.

And if you’re still not convinced about the importance of storytelling for a data scientist, let us consult America’s chief data scientist, D.J. Patil. Yes, this position does exist and is held by Dr. D.J. Patil, at the White House Office of Science and Technology Policy.¹⁹ Dr. Patil told the Guardian newspaper in 2012 that a “data scientist is that unique blend of skills that can both unlock the insights of data and tell a fantastic story via the data.”²⁰ Ain’t I glad to see storytelling being mentioned by Dr. Patil as a key characteristic of data science?

I find it quite surprising that even when the world’s largest big data firm and the world’s largest retailer define data science in terms of storytelling capabilities, we still see the data science discourse burdened with the jargon of algorithms, methods, and tools.

In summary, I am not arguing that programming and statistical analysis are redundant. In fact, I echo Davenport and Patil (2012) who define the ability to code as data scientist’s “most basic, universal skill.” However, as the programming languages evolve to a state where they will become increasingly convenient to learn programming, the focus will shift to data scientists’ ability to communicate. “More enduring will be the need for data scientists to communicate in language that all their stakeholders understand—and to demonstrate the special skills involved in storytelling with data, whether verbally, visually, or—ideally—both,” wrote Davenport and Patil (2012).

GETTING STARTED WITH DATA SCIENCE

Let me formally introduce you to this book. Imagine a three-way cross between Freakonomics, Statistics For Dummies, and a stats software manual, and you will get Getting Started with Data Science(GSDS). This book offers hands-on training in analytics and is targeted at the two million managers and analysts expected to be proficient users of analytics with big data problems.²¹ The secondary audience for GSDS is the 1.2 million undergraduate students in business and management programs, 260,000 MBA students, and 173,000 graduate students in research-oriented master’s degrees enrolled in the faculties of business and management.²² The tertiary audience for this book is the community of data-oriented practitioners and researchers who would like to do more than just basic tabulations in their applied work.

The success of Freakonomics and Malcolm Gladwell’s several excellent books offer pertinent insights. First, the success of the pop-economics texts reveals an appetite for non-fiction that introduces economics and management concepts to the general reader. Second, readers can relate to even the advanced economics concepts, provided the narrative is powerful and is within the grasp of the average reader. Third, in the age of iPads and iPhones, the printed book still matters.

GSDS is the survival manual for researchers, students, and knowledge workers who have been given the mandate to turn data into gold, but they lack the adequate training in analytics to harness the benefits of data and computing power of their firms. GSDS offers the applied utility of a software manual, but adopts a storytelling, lucid style where stories are woven with data.

Do We Need Another Book on Analytics?

A short answer to the question is yes. This is not to suggest that quality texts do not already exist. The real issue is that most books on statistics and econometrics are often written for students and not practitioners. This also frustrates Hadley Wickham, an uber data scientist and a professor who is revolutionizing data science with his innovations programmed in R.²³ He believes that the emergence of data science as a field suggests “a colossal failure of statistics.” Dr. Wickham warns about the “total disconnect between what people need to actually understand data and what was being taught.”

Similar frustration is expressed by the MIT economist and author Joshua Angrist, and Jorn-Steffen Pischke, who is a professor at the London School of Economics. Angrist and Pischke are authors of Mastering ‘Metrics, which offers a fresh take on problem solving analytics.²⁴

In an essay for the World Economic Forum, Angrist and Pischke wonder aloud about “what’s the use of econometrics...at least as currently taught?” The same way the academic statistics has lost touch with real-life applications, econometrics in academic settings has also become an inert exercise in parameter estimation with no real connection to the real world. “It’s both remarkable and regrettable, therefore, that econometrics classes continue to transmit an abstract body of knowledge that’s largely irrelevant for economic policy analysis, business problems, and even for much of the econometric research undertaken by scholars,” wrote Angrist and Pischke.²⁵

This criticism of the out-of-touch econometrics is not new. The authors cite earlier work by Becker and Greene (2001) who have been equally critical of the way econometrics has been taught to undergraduate students at universities. Becker and Greene observed that econometrics text focused primarily on “presenting and explaining theory and technical details with secondary attention given to applications, which are often manufactured to fit the procedure at hand.” ²⁶

The applications, Becker and Greene argued, “are rarely based on events reported in financial newspapers, business magazines or scholarly journals in economics.”

No one can accuse GSDS of the same crime. This book is filled with hands-on examples of the challenges faced by businesses, governments, and societies. The rise of radical Islam in South and West Asia, income inequality in Toronto, teaching evaluations in Texas, commuting times in New York, and religiosity and extramarital affairs are all examples that make GSDS resonate with what is current and critical today.

As I mentioned earlier, in a world awash with data, abundant and ubiquitous computing power, and state-of-the-art algorithms for analytics, a new scarce commodity has emerged: analysts and data scientists. Despite the ready availability of inexpensive tools and data, businesses continue to struggle to turn their data into insights. The challenge is to hire new talent in analytics and to train and repurpose the existing workforce in analytics. This is easier said than done due to less-than-desirable analytics proficiency in the existing workforce and lack of analytics training opportunities in the higher education sector, which is trying to play catch-up with the recent advances in computing, big data, and analytics. Until this occurs, the demand for advanced users of analytics is likely to remain unmet.

Most books on analytics are written for university coursework and are thus not suited for industry professionals. In fact, there is no shortage of excellent books on the theory behind statistics, econometrics, and research methods. There is, however, one major problem: These books are geared to academic settings where the authors are trying to teach skills that they would like to see in their students, some of whom they need to employ as research assistants. These books, therefore, do not appreciate the constraints faced by industry professionals interested in learning about applied analytics who may not have the luxury to spend a semester or a year acquiring basic skills.

The following are some of the unique features of this book.

Repeat, Repeat, Repeat, and Simplify

Often books on analytics and statistics do not repeat discussion on methods and techniques. Most topics are illustrated just once and it is assumed that the reader will grasp the concept by following just one illustration. This is a big assumption on the part of most authors because this approach fails to appreciate how humans learn; they learn by repeating the task until they get it right. Just watch an infant learn a new word.

Professor Deb Roy, an Associate Professor at MIT, proved this point with his newborn son as he recorded every living moment of his son’s life since his birth.²⁷ Dr. Roy recorded the “birth of a word,” a feat never accomplished before. Two important insights emerged from the process. First, the child made tremendous effort to learn a new word by constantly repeating the word over weeks, if not months. The child’s effort demonstrates the key aspect of learning: repetition, until you get it right.

The second key finding about learning a new word was about the child’s caregivers, who simplified their complex language to a structure that was conducive for the child to learn the new word. Every time the child added a new word to his vocabulary, the caregivers, without consciously being aware of their own transformation, modified their language structure and put more effort in enunciating the key word, such as water, until the child learned the word.

Learning analytics, I believe, is no different from learning a new language, and hence analytics should be learned and taught the same way. The text should repeat the key concepts several times so that the learner has adequate opportunity to grasp the new concept. In addition, the concepts should be taught without jargon, using simple language and powerful imagery with the intent to meet the learner halfway at his or her level.

GSDS embraces these key learning principles and repeats the key techniques, tabulations, graphs, and simple statistical analysis numerous times in the text using examples and imageries to which most readers can relate. The goal is to assist adult learners, who are professionals in their own right, in developing skills in analytics so that they may be more productive to their organizations.

Thus, GSDS aims to serve the learning needs of millions of workers who have some basic understanding and knowledge of data and analytics, but are interested in taking leadership roles in the fast-evolving landscape of analytics. GSDS’s hands-on approach will empower those who have done some work in analytics, but have the desire to learn more to pursue senior-level opportunities.

Chapters’ Structure and Features

Chapters 1, 2, and 3 establish the foundation for the book. A book on data science cannot be complete without a detailed discussion of what data is, its types, and where can one find it.²⁸ Chapter 2serves this purpose. My decision to devote a complete chapter to data should be indicative of how important it is to get data “right.”

Chapter 3 reinforces the primary message in this book about using data to build strong narratives. I illustrate this by sharing several examples of storytelling using data and analytics. The remaining chapters offer hands-on training in analytics and adopt the following structure.

Chapters 4 and higher are structured to tell one or more stories. Chapter 4, for instance, tells the story of the rise of Brazil and its adoption of the Internet in communication technologies. Chapter 5 tells the stories of the Titanic, and how teaching evaluations might be influenced by instructors’ looks. Chapter 7 starts with a discussion of smoking and its adverse impacts on society. The underlying theme throughout these chapters is that analytic methods, algorithms, and statistical tools are discussed not in an abstract way, but in a proper socioeconomic context so that you might be able to see how the story is being told.

Chapters 4 and higher first introduce the problems needing answers. I then advance some hypotheses to test using the tools and methods introduced in the chapter. For each method, I first explain the problem to solve, what tools are needed, what the desired answers are, and in the end, what story to tell.

Data is centric to data science and also to this book. I provide ready access to data sets in the chapters so that you could follow and repeat the analysis presented in this book. Most chapters use more than one data set to illustrate concepts and tell stories from analytics. Even in chapters that are focused on advanced statistical and analytics methods, I start with simple tabulations and generate graphics to provide the intuitive foundations for advanced analytics work. For Chapters 4 and higher, I repeat this strategy to ensure that you can get a hang of why simple tools and insightful graphics are equally as important as advanced analytical tools.

Each chapter works its way from simple analytic tools to advanced methods to find answers to the questions I pose earlier in the chapter. In most cases, the final sections test the hypothesis stated earlier in the chapter. In some chapters, I illustrate how to tell a story by reproducing the analytics-driven blogs/essays I have published earlier to answer questions that concern our society.

I have made a deliberate effort to find interesting puzzles to solve using the analytics. I chose data sets for relevance and their nuance value to keep you interested in the subject. Using data, this book addresses several interesting questions. Following is a brief sample of questions I have answered using data and analytics.

• Are religious individuals more or less likely to have extramarital affairs?

• Do attractive professors get better teaching evaluations?

• What motivates one to start smoking?

• What determines housing prices more: lot size or the number of bedrooms?

• How do teenagers and older people differ in the way they use social media?

• Who is more likely to use online dating services?

• Why do some people purchase iPhones and others Blackberries?

• Does the presence of children influence a family’s spending on alcohol?

Analytics Software Used

Unlike most texts in statistics and analytics, which settle on one or the other software, GSDS illustrates all concepts and procedures using the most frequently used analytics software, namely R, SPSS, Stata, and SAS.

The very purpose of using multiple platforms for GSDS is to make it attractive to readers who might have invested in learning a particular software in the past, and who would like to continue using the same platform in the future.

I have demonstrated mostly Stata and R in the textbook. Some chapters demonstrate analytics with SPSS. I have not demonstrated the same statistical concepts for the four software in the book. This has helped me to keep this book limited to a manageable size. However, this should not deter the reader who might not find her favorite software illustrated in the book. The book is accompanied with a live website (www.ibmpressbooks.com/title/9780133991024) that offers hundreds of additional pages and software codes to illustrate every method from the book in R, SPSS, Stata, and SAS. In fact, I encourage readers to develop the same resources in other excellent software, such as Eviews, LimDep, and Python, and share with me so that I may make them available on the book’s website. Of course, we will respect your intellectual property and credit your effort.

The website offers the following resources to assist the applied reader for a hands-on experience in R, SPSS, Stata, and SAS. The files are sorted by chapters.

• The data sets in proprietary format for the four computing platforms

• Individual script files (codes) for each software

• A PDF file showing the output from each software

With these resources at hand, the reader can repeat the analysis illustrated in the book in the computing platform of their choice.

WHAT MAKES SOMEONE A DATA SCIENTIST?

Now that you know what is in the book, it is time to put down some definitions. Despite their ubiquitous use, consensus evades the notions of big data and data science. The question, “who is a data scientist?” is very much alive and being contested by individuals, some of whom are merely interested in protecting their discipline or academic turfs. In this section, I attempt to address these controversies and explain why a narrowly construed definition of either big data or data science will result in excluding hundreds of thousands of individuals who have recently turned to the emerging field.

“Everybody loves a data scientist,” wrote Simon Rogers (2012) in the Guardian. Mr. Rogers also traced the newfound love for number crunching to a quote by Google’s Hal Varian, who declared that “the sexy job in the next ten years will be statisticians.”

Whereas Hal Varian named statisticians sexy, it is widely believed that what he really meant were data scientists. This raises several important questions:

• What is data science?

• How does it differ from statistics?

• What makes someone a data scientist?

In the times of big data, a question as simple as, “What is data science?” can result in many answers. In some cases, the diversity of opinion on these answers borders on hostility.

I define data scientist as someone who finds solutions to problems by analyzing big or small data using appropriate tools and then tells stories to communicate her findings to the relevant stakeholders. I do not use the data size as a restrictive clause. A data below a certain arbitrary threshold does not make one less of a data scientist. Nor is my definition of a data scientist restricted to particular analytic tools, such as machine learning. As long as one has a curious mind, fluency in analytics, and the ability to communicate the findings, I consider the person a data scientist.

I define data science as something that data scientists do. Years ago, as an engineering student at the University of Toronto I was stuck with the question: What is engineering? I wrote my master’s thesis on forecasting housing prices and my doctoral dissertation on forecasting homebuilders’ choices related to what they build, when they build, and where they build new housing. In the civil engineering department, others were working on designing buildings, bridges, tunnels, and worrying about the stability of slopes. My work, and that of my supervisor, was not your traditional garden-variety engineering. Obviously, I was repeatedly asked by others whether my research was indeed engineering.

When I shared these concerns with my doctoral supervisor, Professor Eric Miller, he had a laugh. Dr. Miller spent a lifetime researching urban land use and transportation, and had earlier earned a doctorate from MIT. “Engineering is what engineers do,” he responded. Over the next 17 years, I realized the wisdom in his statement. You first become an engineer by obtaining a degree and then registering with the local professional body that regulates the engineering profession. Now you are an engineer. You can dig tunnels; write software codes; design components of an iPhone or a supersonic jet. You are an engineer. And when you are leading the global response to financial crisis in your role as the chief economist of the International Monetary Fund (IMF), as Dr. Raghuram Rajan did, you are an engineer.

Professor Raghuram Rajan did his first degree in electrical engineering from the Indian Institute of Technology. He pursued economics in graduate studies, later became a professor at a prestigious university, and eventually landed at the IMF. He is currently serving as the 23rd Governor of the Reserve Bank of India. Could someone argue that his intellectual prowess is rooted only in his training as an economist and that the fundamentals he learned as an engineering student played no role in developing his problem-solving abilities?

Professor Rajan is an engineer. So are Xi Jinping, the President of the People’s Republic of China, and Alexis Tsipras, the Greek Prime Minister who is forcing the world to rethink the fundamentals of global economics. They might not be designing new circuitry, distillation equipment, or bridges, but they are helping build better societies and economies and there can be no better definition of engineering and engineers—that is, individuals dedicated to building better economies and societies.

So briefly, I would argue that data science is what data scientists do.

Others have much different definitions. In September 2015, a co-panelist at a meetup organized by BigDataUniversity.com in Toronto confined data science to machine learning. There you have it. If you are not using the black boxes that make up machine learning, as per some experts in the field, you are not a data scientist. Even if you were to discover the cure to a disease threatening the lives of millions, turf-protecting colleagues will exclude you from the data science club.

Dr. Vincent Granville (2014), an author on data science, offers certain thresholds to meet to be a data scientist.²⁹ On pages 8 and 9 in Developing Analytic Talent Dr. Granville describes the new data science professor as a non-tenured instructor at a non-traditional university, who publishes research results in online blogs, does not waste time writing grants, works from home, and earns more money than the traditional tenured professors. Suffice it to say that the thriving academic community of data scientists might disagree with Dr. Granville.

Dr. Granville uses restrictions on data size and methods to define what data science is. He defines a data scientist as one who can “easily process a 50-million-row data set in a couple of hours,” and who distrusts (statistical) models. He distinguishes data science from statistics. Yet he lists algebra, calculus, and training in probability and statistics as necessary background “to understand data science” (page 4).

Some believe that big data is merely about crossing a certain threshold on data size or the number of observations, or is about the use of a particular tool, such as Hadoop. Such arbitrary thresholds on data size are problematic because with innovation, even regular computers and off-the-shelf software have begun to manipulate very large data sets. Stata, a commonly used software by data scientists and statisticians, announced that one could now process between 2 billion to 24.4 billion rows using its desktop solutions. If Hadoop is the password to the big data club, Stata’s ability to process 24.4 billion rows, under certain limitations, has just gatecrashed that big data party.³⁰

It is important to realize that one who tries to set arbitrary thresholds to exclude others is likely to run into inconsistencies. The goal should be to define data science in a more exclusive, discipline- and platform-independent, size-free context where data-centric problem solving and the ability to weave strong narratives take center stage.

Given the controversy, I would rather consult others to see how they describe a data scientist. Why don’t we again consult the Chief Data Scientist of the United States? Recall Dr. Patil told the Guardiannewspaper in 2012 that a “data scientist is that unique blend of skills that can both unlock the insights of data and tell a fantastic story via the data.” What is admirable about Dr. Patil’s definition is that it is inclusive of individuals of various academic backgrounds and training, and does not restrict the definition of a data scientist to a particular tool or subject it to a certain arbitrary minimum threshold of data size.

The other key ingredient for a successful data scientist is a behavioral trait: curiosity. A data scientist has to be one with a very curious mind, willing to spend significant time and effort to explore her hunches. In journalism, the editors call it having the nose for news. Not all reporters know where the news lies. Only those who have the nose for news get the story. Curiosity is equally important for data scientists as it is for journalists.

Rachel Schutt is the Chief Data Scientist at News Corp. She teaches a data science course at Columbia University. She is also the author of an excellent book, Doing Data Science. In an interview with the New York Times, Dr. Schutt defined a data scientist as someone who is part computer scientist, part software engineer, and part statistician (Miller, 2013). But that’s the definition of an average data scientist. “The best,” she contended, “tend to be really curious people, thinkers who ask good questions and are O.K. dealing with unstructured situations and trying to find structure in them.”

Existential Angst of a Data Scientist

Statisticians, wrote Rob Hyndman, “seem to go through regular periods of existential crisis as they worry about other groups of people who do data analysis.”³¹ Rob Hyndman is no ordinary statistician. He is a professor of econometrics and business statistics at Monash University in Australia. He is also the author of highly regarded texts on time series forecasting (http://robjhyndman.com/publications/). In addition, he is the lead programmer for several packages in R including the one on time series forecasting (http://robjhyndman.com/publications/software/). In a blog posted in December 2014, Professor Hyndman addresses the question: “Am I a data scientist?”

He explains how statisticians are concerned that individuals in other disciplines—he mentions computer science, and I would add those in the faculty of engineering—are actively engaged in analyzing data. Shouldn’t statisticians be the only ones working with data? After all, their entire training is focused on how best to analyze data.

Unlike some statisticians, Professor Hyndman is not worried about others stepping onto his academic turf. In fact, he is welcoming of those who have embraced analytics. He wrote:

The different perspectives are all about inclusiveness. If we treat statistics as a narrow discipline, fitting models to data, and studying the properties of those models, then statistics is in trouble. But if we treat what we do as a broad discipline involving data analysis and understanding uncertainty, then the future is incredibly bright.

I find parallels between my definition of data science, which is data science is something that data scientists do, and that of Professor Hyndman. “I am a data scientist because I do data analysis, and I do research on the methodology of data analysis,” he wrote. He explains that he brings statistical theory and modelling to the data science table; others might approach data science with equally valuable skills attained in different disciplines, including but not limited to, computer science. He urges that we must adopt a team perspective on data science. I wholeheartedly agree.

Professor Hyndman gives examples from the medical profession. We are used to going to general practitioners for common ailments. However, when we deal with serious health challenges that may affect a particular part of our physiology, we turn to specialists; for example, cardiologists, nephrologists, and neurosurgeons. We understand and appreciate that it is beyond the capacity of a single individual to have all the expertise needed to deal with the entire rubric of health-related complexities one may face. Thus, there is no one doctor who specializes in cardiology, nephrology, and neurosurgery. The odds of finding such specialist are more remote than finding a unicorn.

Data Scientists: Rarer Than Unicorns

If you believe a data scientist is one who excels in computer programming, statistics, and econometrics, possesses the subject matter expertise, and is an exceptional storyteller, then you must realize that you are in fact describing the equivalent of a unicorn. Jeanne Harris, the co-author of Competing on Analytics, and Ray Eitel-Porter, Managing Director of Accenture Analytics in the UK, know too well about the scarcity of talent when it comes to data scientists. Writing in the Guardian, they explained the frustration of executives who believe that finding the perfect data scientist was perhaps as rare as finding a unicorn.³²

Describing the perfect data scientist as a unicorn is quite common now, even though many still might not understand the extent of expertise required in a variety of disciplines that makes one a unicorn. In fact, in September 2015, I spoke at a data science meet-up, which was sponsored by IBM. When one of my co-panelists asked those present in the audience to raise their hands if they considered themselves statisticians, a couple of people raised their hands from a crowd of about hundred people. What I found surprising was that no fewer than five individuals raised their hands when we asked unicorns to identify themselves. Obviously, we all think very highly of our own limited skills.

I trace the origin of the term unicorn to describe a perfect data scientist to Steve Geringer, who is a U.S.-based machine-learning consultant.³³ Mr. Geringer, in a blog in January 2014, presented a Venn diagram to highlight the diversity of skills required in a good data scientist (Figure 1.2). He depicts data science as an all-encompassing discipline that brings under its tent the diverse fields of computer science, math and statistics, and subject matter expertise

Figure 1.2 Making of a data scientist in a Venn diagram

Figure 1.2 presents a rather consensus definition of data science in that it is not merely confined to a particular discipline, but in fact lies at the intersection of several disciplines, which contribute to our abilities to collect, store, and analyze data. What is interesting in Figure 1.2 is the depiction of subsets of data science, which emerge at the intersection of two or more disciplines. For instance, an individual who specializes in computer science and has a profound understanding of statistical theory is the one involved with machine learning. An individual who specializes in computer science and possesses subject matter expertise in a particular discipline, for instance, health sciences, is engaged in the traditional trade of software development. Similarly, an individual with fluency in mathematics and statistics combined with the subject matter expertise is the one engaged in traditional research. An example of traditional researchers is epidemiologists, who have profound understanding of the science of disease and are fluent in statistical analysis.

The magic really happens when we combine the three diverse set of skills—that is, computer science, statistical analysis, and subject matter expertise. The fusion of these three skills give birth to unicorns. Rob Hyndman, D. J. Patil (Davenport and Patil, 2012), Nate Silver (FiveThirtyEight.com), and Hadley Wickham (Kopf, 2015) are all unicorns.

BEYOND THE BIG DATA HYPE

Big data makes only a guest appearance in this book. Today, even those who cannot differentiate between standard deviation and standard error are singing praise of big data and the promise it holds. I must confess I have not yet drank the Kool-Aid. This is not to argue that I am in denial of big data. Quite the contrary. I am, like thousands of others who have worked with large data sets over the past several decades, know that big data is a relative term, and is more of a moving target, and that it has always been around. What has changed today is the exquisite marketing around big data by the likes of IBM, SAS, and others in the analytics world, which has created mass awareness of what data can do for businesses, governments, and other entities.

The slick marketing has done a great favor to data science. It has converted the skeptics by millions. It is no secret that most university graduates leave with an awkward feeling about their statistics courses. Some describe it as an invasive, yet unavoidable, medical examination that left their minds and bodies a little sore. Only in the geek-dominated domains of computer science, economics, engineering, and statistics does one find individuals enthralled about data and statistical analysis. The sustained marketing campaign about big data analytics, which IBM and others have run across the globe in publications of high repute since 2012, has made big data a household name.

The downside of this enthusiastic embrace of big data is the misconception that big data did not exist before. This would be a false conclusion. In fact, big data has always been around. Even with the smallest storage capacity, any large data set that exceeded the storage capacity was big data. Given the massive increase in capacity to hold and manipulate data in the past decade, one could see that what was big data yesterday is not big data today, and will certainly not be regarded big data tomorrow—hence the notion of big data being a moving target.

Dr. Amir Gandomi, who was a postdoctoral fellow with me at Ryerson University, and I ended up writing the paper on the hype around big data and took upon ourselves to describe various definitions of big data and the analytics deployed to analyze structured and unstructured data sets.³⁴ Our primary assertion in the paper was the following: When it comes to big data, it’s not the size that matters, but how you use it. “Size is the first, and at times, the only dimension that leaps out at the mention of big data. This paper attempts to offer a broader definition of big data that captures its other unique and defining characteristics,” we wrote.

Because open access is available for our paper on big data, I decided not to repeat the discussion in this book, but instead encourage readers to consult the Journal directly. Thus, you may find big data playing only a minor role in this book. You will also notice that almost all data sets used in this book are small-sized data. I expect the critics to have a field day with this feature of the book. Why on earth would someone use small data at a time when the world is awash with big data?

The answer is simple. It is quite likely that by the time you review this book, the definition of big data would have evolved. More importantly, this book is intended to be the very first step on one’s journey to becoming a data scientist. The fundamental concepts explained in this book are not affected by the size of data. Thus, small-sized data sets are likely to be less intimidating for those who are embarking on this new brave journey of data science, and the same concepts that apply to small-sized data could readily be applied to big data.

Big Data: Beyond Cheerleading

Big data coverage in academic and popular press is ubiquitous, but often it ends up telling the following three stories. These include how big data helped Google predict flu trends faster than the government, how UPS saved millions of miles with advanced algorithms, and how the retailer Target became aware of a teenager’s pregnancy before her father did. If one were to take off the cheerleading uniform and the blinders, one would realize that neither big data nor advanced analytics were behind these so-called big data miracles.

Let me begin with Google’s flu trends. Writing in the New York Times, Gary Marcus and Ernest Davis explained the big problems with big data.³⁵ They described how in 2009 Google claimed to have a better understanding of flu trends than the Centers for Disease Control and Prevention. Google relied on flu-related searches and was able to pick up on the incidence of flu across the United States. As of late, Google’s flu trends is making more bad predictions than good ones. In fact, a 2014 paper in the journal Science revealed that Google’s algorithm was predicting twice as many visits to doctors for influenza-like illnesses.³⁶ The authors attributed the erroneous Google forecasts to “big data hubris” and algorithm dynamics in social media, which change over time and make it difficult to generate consistent results. “‘Big data hubris,’” they explained, “is the often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis.”

Big Data Hubris

Another example of big data hubris dates back to 1936 when Alfred Landon, a Republican, was contesting the American presidential elections against F.D. Roosevelt. Long before the word big data was coined, a publication decided to engage in a big data exercise of their time. Tim Harford explains the story in the New York Times.³⁷ The Literary Digest, a respected publication, decided to survey 10 million individuals, which constituted one-fourth of the electorate, about their choice of the presidential candidate. The Digest compiled 2.4 million responses received in the mail and claimed that Alfred Landon would win by a landslide. It predicted Landon would receive 55% of the vote, whereas Roosevelt would receive 41%.

On election day, F.D. Roosevelt won by a landslide by securing 61% of the votes. His Republican opponent could muster only 37% of the votes. While the Literary Digest was busy compiling 2.4 million responses, George Gallup, a pollster, conducted a small survey of a few thousand voters and forecasted Roosevelt’s victory by a comfortable margin. “Mr. Gallup understood something that the Literary Digest did not. When it comes to data, size isn’t everything,” wrote Mr. Harford.

What happened was that the Literary Digest posed the right question to the wrong (unrepresented) sample, probably the elites who subscribed to their publication. The elite favored Landon, which was captured by the Digest in the survey. However, a much smaller, yet representative of the electorate, sample of 3,000 individuals, polled by Mr. Gallup revealed the true intent of the electorate, which favored President Roosevelt.

I must point out that regardless of how sophisticated a method or technique is, human error and hubris can cause major embarrassment for those involved in forecasting. A few years later, Mr. Gallup made the wrong prediction in the 1948 presidential elections when he forecasted that Republican Thomas Dewey would defeat Harry Truman.

Gallup blamed the error on the decision to stop polling weeks before the November 2 elections. During that period, Harry Truman was busy crisscrossing the nation campaigning on a platform of civil rights. Millions attended Truman’s speeches and became convinced of his ideals. The other limiting factor was that polling was often done using telephones, which were more common among Dewey’s voters than those of Truman’s.

Decades later, we again saw pollsters and news media forecast victory for yet another Republican candidate (George W. Bush), whereas the final vote count was much less definitive.

Leading by Miles

United Parcel Service (UPS) is a global leader in logistics and express delivery services. Its fleet of thousands of trucks, boats, and airplanes helps make deliveries all across the globe. UPS made big news with big data when it ceremoniously announced that by using big data and advanced analytics, its logistics experts were able to shave off millions of miles from its vehicles’ itineraries. This meant millions of dollars in savings.

In an interview with Bloomberg Business, David Barnes, Chief Information Officer for UPS claimed that they “were using big data to drive smarter.”³⁸

Although the CIO at UPS believes they are harnessing the power of big data, the team that implemented those advanced algorithms was not convinced whether it was indeed big data that delivered the dividends. Jack Levis is the UPS Senior Director of Process Management. He avoids the term big data. His view is that UPS has been working with terabytes of data delivering millions of packages since the early ‘90s, which was long before the term big data became popular. To Mr. Lewis, “Big data is a ‘how’; it’s not the ‘what.’ The ‘what’ is big insight and big impact, and if you do that through big data, great.”³⁹

UPS indeed was able to have a significant impact on its operations by deploying advanced algorithms to reduce the number of miles traveled by its fleet of nearly 100,000 vehicles. Engineers at UPS have been working on finding innovative solution for the “traveling salesman problem,” which refers to minimizing the total transportation effort for an individual or a fleet of vehicles by devising cost-minimizing routes. They “reduced 85 million miles driven a year. That’s 8.5 million gallons of fuel that we’re not buying and 85,000 metric tons of CO₂ not going in the atmosphere.” Although these are substantial improvements in business operations, they certainly are not a result of unique algorithms. See, I know a thing or two about routing algorithms. I have spent a considerable time implementing such algorithms since the mid-nineties. I worked with a software, TransCAD, which is designed to build large and complex models to forecast travel times and traffic volumes on a street network. I developed one such model for the greater Montréal area when I taught in the faculty of engineering at McGill University. The model predicted travel times and volumes on 135,000 bidirectional links that comprised Montreal’s regional street network.⁴⁰

Off-the-shelf software like TransCAD and EMME/2 offered the functionality to minimize travel times/distance/cost for a fleet operating on a large network long before UPS made the “traveling salesman problem” popular. Just like IBM made big data a media sensation, UPS made route assignment algorithms popular with the popular press.

Predicting Pregnancies, Missing Abortions

The poster child of big data and predictive analytics stories is a long piece in the New York Times by Charles Duhigg, who is also the author of an engaging book, The Power of Habit. Mr. Duhigg profiles Andre Pole, a data scientist at Target who developed a model that would identify potential pregnant customers by reviewing their past purchases. Based on the analysis of past purchases, Target’s researchers observed that a customer purchasing calcium, magnesium, and zinc supplements had a high probability of being in the first 20 weeks of pregnancy. Armed with this insight, the marketing team mailed coupons for other related items to expecting mothers.⁴¹

This was all fine and dandy until one day a father walked into a Target store in Minneapolis angrily complaining about the coupons for cribs and diapers sent to his teenage daughter. Mr. Duhigg reported the same father calling the store weeks later to apologize primarily because unbeknown to him his daughter was pregnant.

Charles Duhigg’s account of big data and predictive analytics making retailers aware of a young woman’s pregnancy before her father has created a false image of what data and analytics could accomplish. Harford (2014) explains the obvious pitfalls in such glorified portrayal of analytics. For starters, getting one pregnancy correctly predicted does not say much about the false positives; that is, the number of customers who were incorrectly identified as being pregnant.

Let’s assume for a second that predictive analytics empower retailers to be so precise in forecasting that they could predict pregnancies by reviewing customer purchases. Such powerful analytics should contribute to the profitability of retailers and help them minimize losses. The reality, however, is quite different.

I find it quite surprising that Target had the capability to predict a teenager’s pregnancy, yet it failed to see its operations being aborted in Canada. In January 2015, Target packed up its operations in Canada and took a $5.4 billion hit on its balance sheet. Target entered the Canadian market with much fanfare in March 2013 and opened 133 stores across Canada. In fewer than two years, the retailer called it quits after struggling to keep its shelves stocked and customers happy. The multibillion-dollar losses also had a human face. Target’s 17,000 Canadian employees lost their livelihood.⁴²

I believe Target serves as an excellent example of not falling for the hype. Vendors who specialize in big data products will spare no effort to project big data and analytics as the panacea for all our problems. At the same time, journalists and others in their naivety repeat the so-called success stories of big data and analytics without applying a critical lens to them. What happened to Target could happen to any other firm regardless of the size of its databases and the sophistication of its predictive algorithms. Remember, no amount of digital sophistication is a substitute for adequate planning and execution.

WHAT’S BEYOND THIS BOOK?

Is it a good time to be involved in data science and analytics? The discussion presented so far in this chapter offers a clear answer: Yes. In fact, some would argue that it might already be a little too late for some to get involved in this emerging field. I would contend that the sooner you start, the better it is.

In fact, at the St. Luke Elementary School in Mississauga, Ontario, students as young as four are being exposed to the fundamentals of data science. On a recent visit to the school, I was pleasantly surprised to see students, even in junior kindergarten, were being taught to summarize data in graphics. As you will see in Chapter 5 of this book, data visualization is data science’s most powerful weapon in communicating your findings. In a very creative way, children were learning to make pictographs, which in statistics are called bar charts (see Figure 1.3).

Source: Photographed by the author on February 19, 2015, at the St. Luke Elementary School in Mississauga, Ontario

Figure 1.3 Kindergarten students making bar charts in Canada

Young students were asked about how they arrived at school every day. If they arrived by car, they were asked to paste a picture of a car on a chart. If they arrived by bus, they pasted a picture of a bus, and those who walked to school pasted a picture of a young boy walking. The result was a bar chart clearly depicting the frequency distribution of the mode of travel to school for the 27 students enrolled in the kindergarten class.

I have taught statistics and research methods for over 15 years now. I have seen even senior undergraduate students struggling with presenting their information in simple bar charts. If the four-year-olds at St. Luke can generate bar charts of such quality today, imagine their potential 20 years down the road.

Those interested in switching to data science and adopting it as a career would like to know where, and whom, they might turn to for training in data science and analytics. Given the vast interest in data science, it is obvious that universities and colleges will be unable to deal with the demand for training in data science and analytics. In fact, universities and colleges are merely catching up to the demand, and are launching mostly graduate programs in data science. As I write these words in Fall 2015, only a handful of undergraduate degrees in data science have surfaced in North America and Europe. University of San Francisco, for instance, offers an undergraduate major in data science.⁴³ Similarly, University of Nottingham in the UK offers a BSc in data science.⁴⁴ The web portal Data Science 101 offers a partial list of undergraduate programs in data science.⁴⁵

Because this book is primarily targeted at those who have already completed an undergraduate degree, or are in the process of completing one, I am inclined to offer information on programs that are more suitable for such cohorts. Essentially, I believe there are two paths of learning. The first is a structured path where one can pursue further education either full-time or part-time in a graduate program with a focus on data science. Another web portal, datascience.community, offers a comprehensive list of graduate and undergraduate programs majoring in data science and analytics.⁴⁶

At the same time, most colleges and universities offer similar training in their continuing education divisions, which are more suited for those who are already employed and would like to take evening courses to enhance their skills. For individuals who lack discipline, I strongly urge them to pursue the structured route by pursuing further training at a university.

Those who are disciplined enough to follow a curriculum without supervision have a plethora of options available through MOOCs. Specialized platforms dedicated to online learning, such as Coursera and Udacity, are offering free as well as paid courses in a variety of fields including data science and analytics. Julia Stiglitz heads business development at Coursera, which has more than 13 million registered users. She revealed that data science is one of the most popular programs at Coursera. “To satisfy the demand ... means going beyond the traditional walls of higher education,” said Ms. Stiglitz.⁴⁷

Numerous universities have also joined the fray by offering free online courses. Some universities offer academic credits for paid versions of the same courses. Increasingly large corporations involved in data science and predictive analytics have also started to offer online training. BigDataUniversity.com is an IBM initiative with the objective to train one million individuals in data science.

SUMMARY

Data science and analytics have taken the world by storm. Newspapers and television broadcasts are filled with praise of big data and analytics. Business executives and government leaders do not get tired of praising big data and how they have embraced it to improve their respective bottom lines. The hype around big data has created the demand for individuals skilled in analytics and data. Stories of six-figure starting salaries for data scientists are feeding the craze.

In this introductory chapter, I set out to explain that data science and analytics will continue to be the rage for the near future. As we are able to capture increasing amounts of data, the need to analyze data and turn it into workable insights is being felt even more. Firms are climbing over each other to compete for the scarcely available talent in data science.

Whereas most of the discourse about data science and analytics is focused on algorithms and data size, I see storytelling as being an integral part of the data science portfolio. I explained that the ability to program and the knowledge of statistical principles remain the core capabilities of data scientists. However, to excel in this field, one must have the ability to turn insights into convincing narratives.

I have also introduced the differentiating features of this book. Each chapter focuses on one or more puzzles, which I solve using data and analytics. Methods are not randomly introduced. Instead, the nature of the question or puzzle dictates the methods I deploy. The book repeats the primary messages so that the reader is able to register them. This book also illustrates statistical methods using a variety of statistical software and products, thus allowing readers to learn using their favorite tool.

I have also presented a detailed discussion on what data science is and who a data scientist is. Apparently, consensus evades on these important definitions. Instead of defining data science as an application of a particular tool, or an exercise involving data of certain size or beyond, I define a data scientist as one who uses data and analytics to find solutions to challenges she or her organization faces, and is able to present her findings in a convincing narrative.

Lastly, I discuss the hype that surrounds big data. I offer a critical review of the claims being made in the popular press about what is achievable by utilizing big data or predictive analytics. Finally, I introduce readers to options to pursue training in data science and analytics in the traditional academic domains as well as MOOCs, which are being offered by dedicated online portals as well as specialized firms, such as IBM and Microsoft.

ENDNOTES

1. Haider, M. and Akbar, A. (2015, September 18). “Army captain among 20 killed in TTP-claimed attack on PAF camp in Peshawar.” Retrieved September 18, 2015, from http://www.dawn.com/news/1207710.

2. Platt, Marc; Kristie Macosko Krieger; and Steven Spielberg (Producers); and Steven Speilberg (Director). (2015). Bridge of Spies [Motion Picture]. United States: DreamWorks Studios. Goetzman, G. (Producer) and Nichols, M. (Director). (2007). Charlie Wilson’s War [Motion Picture]. United States: Universal Pictures.

3. http://www.satp.org/

4. Haider, M. (2011, August 31). “Exploring the fault lines.” Retrieved September 18, 2015, from http://www.dawn.com/2011/08/31/exploring-the-fault-lines/.

5. Davenport, T. H. and Harris, J. G. (2007). Competing on Analytics: The New Science of Winning(1 edition). Harvard Business Review Press.

6. Miller, C. C. (2013, April 11). “Data Science: The Number of our Lives.” The New York Times. Retrieved from http://www.nytimes.com/2013/04/14/education/edlife/universities-offer-courses-in-a-hot-new-field-data-science.html.

7. Davenport, T. H. and Patil, D. J. (2012). “Data scientist: the sexiest job of the 21st century.” Harvard Business Review, 90(10), 70–6, 128.

8. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., and Byers, A. (2011, May). Big data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute San Francisco.

9. Duhigg, C. (2012, February 16). “How Companies Learn Your Secrets.” New York Times. Retrieved from http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html.

10. Murray, S. (2015, May 11). “MBA Careers: Leaders Need Big Data Analytics to Rise to C-Suite.” Retrieved September 18, 2015, from http://www.businessbecause.com/news/mba-careers/3248/leaders-need-analytics-to-progress.

11. Ibid.

12. Marr, B. (2015, July 6). “Walmart: The Big Data Skills Crisis and Recruiting Analytics Talent.” Forbes. Retrieved September 19, 2015, from http://www.forbes.com/sites/bernardmarr/2015/07/06/walmart-the-big-data-skills-crisis-and-recruiting-analytics-talent/.

13. Miller, C. C. “Data Science: The Number of our Lives.”

14. Lohr, S. (2015, July 28). “As Tech Booms, Workers Turn to Coding for Career Change.” New York Times. Retrieved from http://www.nytimes.com/2015/07/29/technology/code-academy-as-career-game-changer.html.

15. “Education at Galvanize—Programming, Data Science, & More.” (n.d.). Retrieved September 19, 2015, from http://www.galvanize.com/courses/#.Vfzxt_lVhBe.

16. Miller, C. C. “Data Science: The Number of our Lives.”

17. Davenport, T. (2015, February 10). “Why Data Storytelling is So Important, and Why We’re So Bad at It.” Retrieved September 19, 2015, from http://deloitte.wsj.com/cio/2015/02/10/why-data-storytelling-is-so-important-and-why-were-so-bad-at-it/?mod=wsjrc_hp_deloitte.

18. Gordon, J. and Twohill, L. (2015, February). “How Google breaks through.” Retrieved September 19, 2015, from http://www.mckinsey.com/insights/marketing_sales/how_google_breaks_through?cid=other-eml-nsl-mip-mck-oth-1503.

19. D.J. Patil | LinkedIn. (n.d.). Retrieved September 21, 2015, from https://www.linkedin.com/in/dpatil.

20. Rogers, S. (2012, March 2). “What is a data scientist?” Guardian.

21. http://www.mckinsey.com/features/big_data

22. http://www.aacsb.edu/dataandresearch/dataglance/enrollment_total.html

23. Kopf, D. (2015, July 24). “Hadley Wickham, the Man Who Revolutionized R.” Retrieved September 19, 2015, from http://priceonomics.com/hadley-wickham-the-man-who-revolutionized-r/.

24. Angrist, J. D. and Pischke, J.S. (2014). Mastering ‘Metrics: The Path from Cause to Effect. Princeton University Press.

25. Angrist, J. and Pischke, J.S. (2015, May 21). “Why econometrics teaching needs an overhaul.” Retrieved September 19, 2015, from https://agenda.weforum.org/2015/05/why-econometrics-teaching-needs-an-overhaul/?utm_content=buffer98041&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer.

26. Becker, W. E. and Greene, W. H. (2001). “Teaching Statistics and Econometrics to Undergraduates.” The Journal of Economic Perspectives: A Journal of the American Economic Association, 15(4), 169–182.

27. Roy, D. (2011, March). “The birth of a word.” Retrieved September 19, 2015, from http://www.ted.com/talks/deb_roy_the_birth_of_a_word.

28. It is unfortunate to lose the distinction between datum, which is singular noun, and data, which is plural noun. Data are increasingly used in a singular context and because it has taken off, I surrender on this front and treat data as a singular noun.

29. Granville, V. (2014). Developing Analytic Talent: Becoming a Data Scientist. John Wiley & Sons.

30. “More than 2 billion observations.” (n.d.). Retrieved September 21, 2015, from http://www.stata.com/new-in-stata/huge-datasets/.

31. Hyndman, R. J. (2014, December 9). “Am I a data scientist?” Retrieved September 21, 2015, from http://robjhyndman.com/hyndsight/am-i-a-data-scientist/.

32. Harris, J. G. and Eitel-Porter, R. (2015, February 12). “Data scientists: As rare as unicorns.” Guardian.

33. Geringer, S. (2014, January 6). “Steve’s Machine Learning Blog: Data Science Venn Diagram v2.0.” Retrieved September 21, 2015, from http://www.anlytcs.com/2014/01/data-science-venn-diagram-v20.html.

34. Gandomi, A. and Haider, M. (2015). “Beyond the hype: Big data concepts, methods, and analytics.” International Journal of Information Management, 35(2), 137–144. http://dx.doi.org/10.1016/j.ijinfomgt.2014.10.007.

35. Marcus, G. and Davis, E. (2014, April 6). “Eight (No, Nine!) Problems with Big Data.” New York Times. Retrieved from http://www.nytimes.com/2014/04/07/opinion/eight-no-nine-problems-with-big-data.html.

36. Lazer, D., Kennedy, R., King, G., and Vespignani, A. (2014). “The parable of Google Flu: traps in big data analysis.” Science, 343 (14 March). Retrieved from http://scholar.harvard.edu/files/gking/files/0314policyforumff.pdf.

37. Harford, T. (2014, March 28). “Big data: are we making a big mistake?” FT.com. Retrieved September 21, 2015, from http://www.ft.com/intl/cms/s/2/21a6e7d8-b479-11e3-a09a-00144-feabdc0.html.

38. Schlangenstein, M. (2013, October 30). “UPS Crunches Data to Make Routes More Efficient, Save Gas.” Retrieved September 21, 2015, from http://www.bloomberg.com/news/articles/2013-10-30/ups-uses-big-data-to-make-routes-more-efficient-save-gas.

39. Dix, J. (2014, December 1). “How UPS uses analytics to drive down costs (and no, it doesn’t call it big data).” Retrieved September 21, 2015, from http://tinyurl.com/upsbigdata.

40. Spurr, T. and Haider, M. (2005). “Developing a GIS-based detailed traffic simulation model for the Montreal region: Opportunities and challenges.” In Canadian Transport Research Forum Annual Proceedings—Old Foundations, Modern Challenges. CTRF. Retrieved from http://milute.mcgill.ca/Research/Students/Tim_Spurr/Tim_paper_CTRF.pdf.

41. Duhigg, C. (2012, February 16). “How Companies Learn Your Secrets.” New York Times. Retrieved from http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html.

42. Ho, B. S. (2015, January 15). “In surprise move, Target exits Canada and takes $5.4 billion loss.” Retrieved September 21, 2015, from http://www.reuters.com/article/2015/01/15/us-target-canada-idUSKBN0KO1HR20150115.

43. https://www.usfca.edu/arts-sciences/undergraduate-programs/data-science

44. http://www.nottingham.ac.uk/ugstudy/courses/computerscience/bsc-data-science.aspx

45. http://101.datascience.community/2013/08/21/undergraduate-programs-in-data-science/

46. http://datascience.community/colleges

47. Murray, S. (2015, May 11). “MBA Careers: Leaders Need Big Data Analytics To Rise To C-Suite.” Retrieved September 21, 2015, from http://ww.businessbecause.com/news/mba-careers/3248/leaders-need-analytics-to-progress.

Chapter 3. The Deliverable

What makes one a data scientist? If you were to read any other book on the subject, it is likely to speak of someone who specializes in data collection, storage, and manipulation. Some even argue if data science were really a “science.” Others might focus on computer engineering while arguing that the engineering aspect of data management overrides the science aspects of it and vice versa. Others would define data science to be all about software and coding. Some would focus on the size of data and others would focus on what to do with the data. Some would say data science is all about forecasting and predictive analytics, others would contend that it has more to do with data mining. Seldom will you find anyone stress the importance of the narrative.

This book differs from the others because it focuses on the narrative derived from data and analysis. The final deliverable, I argue, is not merely a deck of PowerPoint slides with tables and graphics. The final deliverable is the narrative or the story that the data scientist tells from the insights presented as graphs and tables. A data scientist, I contend, should be as concerned with the narrative as with data and analytics.

The earliest mention of the phrase data science in the news media is that of the firm Mohawk Data Science Corp. in the New York Times in April 1969. The company was one of the pioneers in digitizing documents. It was founded in 1964. Initially, it introduced magnetic tapes to replace punch cards for data storage and transmission.¹ As for data scientist, the first mention I find is in the News Gazette in Champaign, Illinois. The November 2000 story in fact announces the wedding plans for Jeanette Sanders, who was then employed as a data scientist with Pfizer Pharmaceuticals in Groton, Connecticut.

Interestingly, the roots of the terms data science and data scientist take us back to a paper by Bryce and others (2001) who were of the view that the term data scientist might be more appropriate than statistician for the computer and data-centric role of statisticians. Writing in the American Statistician,² they observed: “The terms ‘data scientist’ or ‘data specialist’ were suggested as perhaps more accurate descriptions of what should be desired in an undergraduate statistics degree.”

Regardless of the origins of the term, some still question whether data science is really “a science.” It is quite likely that the term data science evolved in the traditional disciplines, such as statistics and computer science. Bryce and others (2001) offer an insight into the thinking of those structuring the undergraduate curricula in statistics in the early 2000s as they married data and computer science to come up with data science. The debate about how different data science is from traditional statistics will continue for some time.

My definition is radically different from others who view data scientists in the narrow context of algorithms and software code. I believe a data scientist is a storyteller who weaves narrative from bits and bytes, thus presenting a comprehensive argument to support the evidence-based strategic planning, which is essential in a data-rich world equally awash with analytics.

I am not alone in thinking of data scientist as a storyteller. Tom Davenport, a thought leader in data and analytics, explains why storytelling is important for data scientists.³ He observes that despite the “compelling reasons for the importance of stories, most quantitative analysts are not very good at creating or telling them. The implications of this are profound—it means analytical initiatives don’t have the impact on decisions and actions they should. It means time and money spent on acquiring and managing data and analyzing it are effectively wasted.”

Ignoring the importance of narrative restricts the role of a data scientist to that of a computer scientist or an engineer who cherishes the opportunity to improve computing by devising improved algorithms. A data scientist, I argue, is more than just that. By having the control over, and fluency in, developing the narrative, a data scientist ensures the intermediate fruits of her labor, that is the analytics, are nurtured into a final deliverable that serves the decision-making needs of the stakeholders.

By being a storyteller, a data scientist has a better chance of being at the forefront of decision-making. Having one’s byline on the final deliverable further ensures that one gets credit for one’s hard work. An analyst might toil through the inhospitable terrain of raw data sets, clean the data, conduct analyses, and summarize results as graphs and tables. Yet the analyst hands over the fruits of her labor to a colleague or a superior who builds a narrative from the findings and presents it to the stakeholders to be the subject of all praise and glory. A preferred approach is for the analyst to complete the task by offering the final deliverable complete with the narrative, rather than a litany of tables and graphs.

Hal Varian, Google’s chief economist, believes that careers in data analytics will be the most coveted in the near future. Dr. Varian is not referring to computer scientists or statisticians per se. I believe he is referring to those who will effectively turn data into insights. As governments and businesses are able to archive larger amounts of data, the need to turn data into insights will be felt evermore. This will not necessarily generate demand for those who will create new algorithms. Instead, it will increasingly generate demand for those who will apply the existing algorithms to extract insights from data.

Tim O’Reilly is the founder of O’Reilly Media. Mr. O’Reilly devised a list of the seven top data scientists and ranked Larry Page, the founder of Google (Alphabet, as of late), at the top. There is no doubt that Mr. Page has a much larger impact on computing and analytics than anyone else could have. Google has effectively transformed the way we search for and analyze information. However, Mr. Page’s contributions are more in line with that of a brilliant computer scientist. Larry Page earned a bachelor’s of science in computer engineering and a master’s in computer science. Computer science was in Larry Page’s genes. His father, Dr. Carl Vincent Page Sr., was a professor of computer science at Michigan State University. It is not hard to imagine that the future incarnations of Larry Pages are also likely to come from a computer science background.

In a 2011 report, the McKinsey Global Institute claimed that the “United States alone faces a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts to analyze big data and make decisions based on their findings.” I believe that technological innovations would make size irrelevant to analytics. This has happened numerous times in the past. The advances in computer engineering and computer science, the work done by the likes of Google, will make it possible for others to deploy analytics on “large” data sets with success and little effort. Consider running searches on Google. You are already playing with big data when you run a query on Google. The challenge in analytics is to staff the enterprise with data-savvy analysts and managers who can deploy the available tools and algorithms for knowledge and insight generation.

I believe that Larry Page is one of the most influential computer scientists of our time. I also believe that a world awash with data will require even more data scientists, and not necessarily computer scientists, who can turn existing data and other resources into actionable insights and strategies. What data scientists need is not necessarily a background in computer science with advanced programming skills in C, Spark, Hadoop, Java^™, or Python, but a passion for analytics, a curious mind, and the ability to tell a data-driven story.

In this chapter, I focus on the deliverable. The final step in analytics involves communicating the findings to the intended audiences. The communication might take the form of summary tables, figures, and other similar products, which must be complemented with a narrative. I believe that a good data scientist builds a comprehensive road map before she embarks on the journey. What questions need to be answered, what resources are available, and what stories need to be told are the questions a data scientist asks about the final deliverable before embarking on analytics.

This chapter is organized as follows. First, I define the ingredients and requirements of the final deliverable. Second, I explain how to search for background information. Third, I define the tools needed to generate the deliverable. Finally, I reproduce in this chapter standalone research briefs I have published earlier as syndicated blogs. The research briefs serve as examples of how data scientists can use data and analytics to tell compelling stories. Each research note is a standalone piece of varying length between 750 to 2,000 words and uses data to weave the narrative. The briefs discuss a broad range of topics related to human and urban development in Canada, the United States, and South Asia. These pieces were published by the Dawn newspaper in Pakistan, Global News in Canada, and the Huffington Post.

THE FINAL DELIVERABLE

The ultimate purpose of analytics is to communicate findings to the concerned who might use these insights to formulate policy or strategy. Analytics summarize findings in tables and plots. The data scientist should then use the insights to build the narrative to communicate the findings. In academia, the final deliverable is in the form of essays and reports. Such deliverables are usually 1,000 to 7,000 words in length.

In consulting and business, the final deliverable takes on several forms. It can be a small document of fewer than 1,500 words illustrated with tables and plots, or it could be a comprehensive document comprising several hundred pages. Large consulting firms, such as McKinsey⁴ and Deloitte,⁵ routinely generate analytics-driven reports to communicate their findings and, in the process, establish their expertise in specific knowledge domains.

Let’s review the United States Economic Forecast, a publication by the Deloitte University Press.⁶This document serves as a good example for a deliverable that builds narrative from data and analytics. The 24-page report focuses on the state of the U.S. economy as observed in December 2014. The report opens with a “grabber” highlighting the fact that contrary to popular perception, the economic and job growth has been quite robust in the United States. The report is not merely a statement of facts. In fact, it is a carefully crafted report that cites Voltaire and follows a distinct theme. The report focuses on the “good news” about the U.S. economy. These include the increased investment in manufacturing equipment in the U.S. and the likelihood of higher consumer consumption resulting from lower oil prices.

The Deloitte report uses time series plots to illustrate trends in markets.⁷ The GDP growth chart shows how the economy contracted during the Great Recession and has rebounded since then. The graphic presents four likely scenarios for the future. Another plot shows the changes in consumer spending. The accompanying narrative focuses on income inequality in the U.S. and refers to Thomas Pikkety’s book on the same.⁸ The Deloitte report mentions many consumers did not experience an increase in their real incomes over the years, while they still maintained their level of spending. Other graphics focused on housing, business and government sectors, international trade, labor and financial markets, and prices. The appendix carries four tables documenting data for the four scenarios discussed in the report.

Deloitte’s United States Economic Forecast serves the very purpose that its authors intended. The report uses data and analytics to generate the likely economic scenarios. It builds a powerful narrative in support of the thesis statement that the U.S. economy is doing much better than what most would like to believe. At the same time, the report shows Deloitte to be a competent firm capable of analyzing economic data and prescribing strategies to cope with the economic challenges.

Now consider if we were to exclude the narrative from this report and presented the findings as a deck of PowerPoint slides with eight graphics and four tables. The PowerPoint slides would have failed to communicate the message that the authors carefully crafted in the report citing Piketty and Voltaire. I consider the Deloitte’s report a good example of storytelling with data and encourage you to read the report to decide for yourself whether the deliverable would have been equally powerful without the narrative.

Now let us work backward from the Deloitte report. Before the authors started their analysis, they must have discussed the scope of the final deliverable. They would have deliberated the key message of the report and then looked for the data and analytics they needed to make their case. The initial planning and conceptualizing of the final deliverable is therefore extremely important for producing a compelling document. Embarking on analytics, without due consideration to the final deliverable, is likely to result in a poor-quality document where the analytics and narrative would struggle to blend.

I would like you to focus on the following considerations before embarking on analytics.

What Is the Research Question?

The key to good analytics is to understand the research question. All analytics are conducted to gain insights into a research question. A better understanding of the research question ensures that the subsequent actions are likely to bear fruit. It is also important to recognize that advanced analytics are not a panacea for a poorly structured research question.

For example, assume that you are the senior director for human resources at a large firm. You have been asked to review a grievance filed by women workers who contend that the firm pays them less than what it pays men. Your goal is to investigate whether a gender bias exists in compensation.

Over the years, and almost all across the world, women are known to earn, on average, less than men. Several factors contributed to the wage gap between men and women. The choice for employment sectors, labor market preferences for certain skills, and the difference in qualifications are some of the reasons cited for the wage gap.⁹ Labor market studies suggest that the wage gap between men and women has narrowed. Many socially responsible businesses are committed to eliminating the structural barriers and biases that promote such gaps. Yet much more work still needs to be done.

Earlier research has helped identify wage gaps and their determinants. As an analyst, you might be asked to run a basic query on the average compensation for each department broken down by gender. This would be appropriate to answer the question: Does compensation differ for men and women employees in the firm? If your goal is to investigate allegations of gender bias, your research question should be more nuanced than what you have selected.

Gender bias in this particular case implies that women employees are paid less for doing the same job when no difference exists in experience, education, productivity, and other attributes of the work done by men and women. The research question thus has to account for relevant attributes. A more appropriate research question is stated as follows: Given the same education, experience, and productivity levels, does the firm compensate women less than men? You might want to include other considerations, such as whether the firm promotes women to leadership positions as frequently as it promotes men. You might also want to explore the race dimension to see whether the gender bias is more pronounced for racialized women.

What Answers Are Needed?

After you have refined the research question, you need to think about the answers or solutions that you will have to present to the stakeholders. Think not merely about the format of the final report, but also about the content in the answers you will present. Building on the gender bias example, the executives might want your report to offer robust evidence that either refutes or confirms the allegations. Let us assume that your analysis reached a rather fuzzy conclusion that suggested the gender bias might have taken place in some circumstances but not in others. The bias might be apparent in compensation, but the differences are not statistically significant, a concept I later introduce in Chapter 6, “Hypothetically Speaking.”

You need to know in advance whether the executives at the firm would be satisfied with an answer that falls short of being a smoking gun. If you are expected to produce a robust response, then you must choose those methods and data that will help you reach a conclusive answer. In fact, you must always select the most appropriate and robust methods to answer the questions. However, you must also be cognizant of the deadlines. If the right method will take months to collect and analyze data, and your deadline is a week from today, you must find appropriate alternatives to provide an answer that might have limitations, but could guide, and provide the motivation for, further investigations. The analytics should inform and guide the narrative and not the other way around. Remember, integrity is the most important attribute of a data scientist.

At the same time, you should think of the format in which the answer must be presented to the executives. If they are expecting a report, you should be prepared to draft a formal report. If they are expecting a PowerPoint presentation, then you should produce one that will help you convey your findings.

How Have Others Researched the Same Question in the Past?

After you have sharpened your research question, you might want to see how others in the past have researched the same or similar question. In academia, this stage is referred to as a literature review. In business, this is called background research. Consider the gender bias example I discussed earlier. You should consult the academic and professional literature to see how others have explored the same question in the past. This enables you to benefit from the experience of other researchers.

Sources of Information

Google Scholar is arguably the best place to begin your search for related research. Unlike other indexes, discussed later in the chapter, Google Scholar archives publications from academic and other sources. In addition, Google Scholar searches the entire text of the publication and not just its abstract and title. Google Scholar reported 208,000 items when I searched for gender bias and wages.¹⁰ I refined the search term by enclosing gender bias in quotes. This reduced the search to 21,200 records, still an unrealistically large number of publications to review. One can select the apparently more relevant publications that are usually listed first by Google Scholar. Otherwise, one can restrict the search to the title of the publication to get a shorter list of publications. A search for gender bias and wages restricted to the title of publications returned only six documents.

Academics have an “unfair” advantage over others in searching for information. Most universities subscribe to custom indexes that archive information by topic. Such services are prohibitively expensive to subscribe to for small firms, and even more so for individual researchers. However, there is hope. Often public libraries subscribe to some of the commonly used indexes. Otherwise, you might want to subscribe to the library at your former alma mater, which will allow you to access, even remotely, the digital resources subscribed to by the university. Remember, universities often offer discounted subscriptions to alums. Also, try the library at your local university for a guest membership.

Assuming that you have secured access to resources commonly found at an institute of higher learning, you will have a rich choice of resources to play with. I want to recommend a few key resources that can help your life as a data scientist. I restrict the discussion to resources more relevant to business and social sciences. If you work in engineering, law, or medicine, you might have to explore other domain-specific resources.

In business and social science research, news coverage of events offers great insights. The events could be socio-political, economic, or environmental. The news reports record history in real time. For instance, the news coverage of an oil spill would carry details about the saga as it unfolds, how different actors involved in the containment react, and what methods or techniques are used for containment.

Factiva and LexisNexis are the two most powerful tools to search the entire text of newspapers and other news media from across the world, with some limitations. Whereas Factiva covers only news, the Lexis part of LexisNexis searches through legal documents and judgments.

The biggest limitation for these indexes is language. Most newspapers covered by Factiva and LexisNexis are published in English. If you are researching political outcomes in the Middle East, then the word on the street, the wall chalking, and the most relevant commentary will be in Arabic. Neither Factiva nor LexisNexis will be of much help in knowing what the buzz is in Arabic. However, if you can be content with the coverage in the English press, Factiva and LexisNexis are your best options.

Using Factiva, for instance, you can search for a combination of words in the title, lead paragraph, or the entire text of the news item. You can restrict your search to publications of a particular type, jurisdiction, or an individual source. Factiva also offers unique news pages focused on specific topics where it aggregates recent relevant coverage of the same topic.

I am aware of several good tools to search the academic press. There is still no good source to search the publicly available but privately generated research. Banks, consulting firms, government agencies, and non-governmental organizations (NGOs) generate hundreds of thousands of reports each year. Unless you know of a publication or individually research the available archive online on the respective websites, you are not likely to stumble upon such reports by default. At present, Google Scholar is the best source for materials published in non-academic outlets.

Whereas all indexes permit searching for relevant publications and access to abstracts, some indexes offer access to the entire text of the publication. A good place to start is ProQuest Digital Library, which also offers access to the entire text whenever it is available. Other relevant indexes include Web of Science, EconLit, and Springer Link. Web of Science offers a unique feature that permits sorting results by the times a publication has been cited by others. This is in addition to sorting papers by publication date and relevance.

ProQuest Digital Dissertations is another excellent resource for academic research. Master’s theses and doctoral dissertations contain many more details than what is ultimately published in journal publications resulting from graduate student research. Dissertations often include copies of survey instruments that the authors developed for their research. If you need to collect new data, I believe that adapting a survey instrument for your project is far more efficient than developing one from scratch. You can modify the survey to meet your needs by collaborating with the researcher who devised the original survey instrument.

I have presented here only a small subset of resources. Searching for information using digital resources is evolving rapidly. It is quite likely that by the time this book hits the shelves, other players might have surfaced offering search capabilities across academic, non-academic, and emerging blogospheres. We are fortunate to be living in a time when solutions are emerging at a rate faster than the problems we encounter.

Finders Keepers

In a digital world where information is ubiquitous and without interruption, storing relevant information as we stumble across it is a challenge. Even when we search formally for information, the volume of information is so large that archiving research-related information for future retrieval is becoming increasingly difficult. Recall that when I ran a search for gender bias and wages, Google Scholar returned 21,200 records. Let us assume that after browsing through the first few pages, we determined that at least 22 publications were of interest to us. One option is to type the list of references. Obviously, this exercise will be time-consuming and at the same time will be redundant primarily because we are going to retype the already archived references. In an ideal world, we should be able to select the key references of interest from a search engine or index, which should populate our custom database of relevant information.

Fortunately, several freeware and commercial software programs are available for exactly this purpose. The Paperpile service enables you to generate and maintain bibliographies, and import records from Google Scholar and other web-based environments.¹¹ EndNote and Reference Manager dominate the commercially available software market for generating and maintaining bibliographies. Several freeware options are also available, including Zotero, which I would encourage you to consider.

Using Zotero

To see how Zotero or similar software works, you can set up a free account with Zotero. You have the option to work with the browser version that will maintain your references in the cloud, or you can download a desktop version as well. Google Chrome also offers a plug-in for Zotero.

After setting up the account and installing the plug-in into Google Chrome, suppose I conduct a search for “gender bias” (in quotation marks) and “wages” using Google Scholar. I found 21,200 publications that matched my query. The plug-in adds an additional folder symbol on the right side of the web address field. After Google displays the search results, I can click on the folder symbol, which opens another dialog box enabling me to select the references of interest by clicking on them. Clicking OK saves those references in my personal database. I can choose to include accompanying abstracts, and even the PDF versions of those publications that are readily available for download.

After the chosen references are archived in the database, I can search and retrieve them later and even generate automatic bibliographies without ever needing to type the references. As an example, I have included four references from the ones generated by the search command. I logged in to my account at www.zotero.org to locate the recently added references. I selected the APA style to generate the following bibliography. The entire process from search to bibliography took 20 seconds. I have numbered the references, which is not the norm for the APA style:

1. Borooah, V. K. (2004). Gender bias among children in India in their diet and immunization against disease. Social Science & Medicine, 58(9), 1719–1731. Retrieved from www.sciencedirect.com/science/article/pii/S0277953603003423

2. Estevez-Abe, M. (2005). Gender bias in skills and social policies: the varieties of capitalism perspective on sex segregation. Social Politics: International Studies in Gender, State & Society, 12(2), 180–215. Retrieved from sp.oxfordjournals.org/content/12/2/180.short

3. Hultin, M. and Szulkin, R. (1999). Wages and unequal access to organizational power: An empirical test of gender discrimination. Administrative Science Quarterly, 44(3), 453–472. Retrieved from asq.sagepub.com/content/44/3/453.short

4. Kim, M. (1989). Gender bias in compensation structures: a case study of its historical basis and persistence. Journal of Social Issues, 45(4), 39–49. Retrieved from onlinelibrary.wiley.com/doi/10.1111/j.1540-4560.1989.tb02358.x/full

The Need for a Bigger Shoe Box

The ubiquitous nature of the Internet is such that if one is not talking or speaking, one is busy browsing the Internet. A long list of Net-enabled devices, including smartphones, tablets, and even some TVs, compete with the way we have browsed the Internet in the past—that is, laptops and desktops. The constant flow of information implies that a large enough shoebox is needed to house all the brief and long notes that you scribble and the important emails you want to archive, or ensure that the news story you just read is saved with complete text. Evernote is the shoebox¹² you are looking for. It is freely available with some restrictions or for a nominal monthly subscription, which costs less per month than an expensive cold beverage at the local Starbucks.

I have more than 3,000 notes archived in over three dozen notebooks. I can click and save the entire text in the web browser to Evernote. I can even email notes to my Evernote account to be stored in specific notebooks. In addition, and most important of all, my data is updated on my smart phone, tablet, and laptop in real time. If I am on the move, I am moving with my shoebox.

What Information Do You Need to Answer the Question?

A review of relevant research will put you in a better position to determine what information you need to answer the research questions. To explore the gender bias in wages, for instance, we not only need the details on total compensation received by each employee, but also details such as employees’ sex, age, experience, education, job classification, and some measure of productivity, to name a few. Let us assume that the firm did not maintain detailed records on employees. This would seriously limit the analyst’s ability to answer the question about gender bias. Remember that no amount of technical savviness can be a substitute for little or poor information.

If we realize that the information required to answer the research question is not readily available, we need to determine whether we can collect the information internally or ask an external agency to collect it for us. Consider, for instance, that the company interested in determining the gender bias had collected information about the highest degree obtained by the employees when they first joined the firm. However, the firm’s internal database was not subsequently updated for any subsequent improvement in education of the employees. Over the course of their careers, employees could have gone back to school for part-time graduate studies or diplomas. This is akin to an investment in their respective human capital.

Without this information, the gender bias analysis will be based on dated information on the employees’ education attainment. One way of dealing with this limitation is for the firm to conduct an internal survey, asking the employees to report any additional training, courses, or degrees they might have completed since they started working. With this additional information on education attainment in hand, you can analyze how investment in human capital by the employees might impact their wages. You can also determine the difference between men and women in pursuing higher or technical education opportunities subsequent to their employment with the firm.

At times, the ideal information needed for the analysis might take too long or cost too much to obtain. In such circumstances, the analyst has to make a decision. One option is to postpone the analysis until all relevant information is available. This might not be a viable option because executives need answers and they need them fast. The other option is to adopt a satisficing approach where you proceed with the analysis with what you have and conclude while you adequately state the limitations in your work.

What Analytical Techniques/Methods Do You Need?

It is important to know in advance what methods or techniques to use for the analysis. This has an impact on the type of data you need for the analysis. Consider our gender bias example where we need to determine whether a gender bias in wages or compensation exists when men and women with similar education, experience, and productivity are employed to do the same job. Given that analysts are expected to hold constant other factors while investigating the impact of gender, it appears that a regression-type model is better suited to answer the research question.¹³ If I were asked only to compare the difference in average wages for men and women, I would have opted for a t-test.¹⁴ I demonstrate the use of t-test and regression analysis in Chapter 6.

THE NARRATIVE

The narrative gives life to numbers. Suspended in columns and rows, numbers in tables seldom speak to the reader. The narrative is always what calls attention to the gems latent in tables. After sitting through thousands of presentations over the years, I cannot recall a single incidence when a presenter showed a table onscreen and it made sense without the presenter explaining what was depicted in the table. Numbers slowly start to make sense as the speaker explains what they represent and imply.

The same goes for publications. One needs to read the accompanying text to make sense of the information summarized in tables. I must admit that some exceptionally talented researchers are capable of generating self-explanatory tables. However, they are in a minority. Most analysts have yet to master the art of generating tables that can convey the information without relying on accompanying text.

I believe that even when the tables and charts are well-illustrated, the need for powerful narrative never diminishes. Narrative adds color to facts. It allows you to contextualize the findings. The choice of verbs and metaphors allow you to leave your distinct mark on the findings.

One way to appreciate the power of strong narrative is to compare the coverage of a big event in a local and a national newspaper. Newspapers with a large national subscriber base, such as the New York Times in the U.S. and the Guardian in the UK have the distinct privilege of working with some of the finest writers and editors. Local newspapers, given their limited resources, are often unable to retain the best talent. Still, they cover the same international or national events. Global thought leaders, who have mastered the craft of creative thinking and writing, contribute opinion pieces to national newspapers. You might want to compare commentary on the same issue from a national and local newspaper to see how the quality of narrative makes all the difference even when the subject matter and facts presented are the same.

I would like to share names of the writers I have admired over the years for their ability to communicate ideas with powerful narrative. From the New York Times, I recommend Maureen Dowd, David Brooks, Nicolas Kristoff, and Roger Cohen.

Paul Krugman is a professor of economics, a Nobel Laureate, and a blogger and columnist with the New York Times. I think he serves a much better example of a data scientist for the readers of this book. Professor Krugman illustrates his writings with numbers and graphics (krugman.blogs.nytimes.com/). Writing about the French economy, Mr. Krugman shares a time series plot of the yield on French 10-year bonds. While challenging Eugene Fama and others from the “Chicago School” on public sector spending, he graphs the theoretical relationships between variables of interest. Often, Professor Krugman uses data and charts from FRED (Federal Reserve Economic Data) in his blogs. I believe Professor Krugman is the ultimate data scientist because he dances with bold ideas, flirts with numbers, and romances the expression.

Another good source of data-driven narrative are the data blogs set up by leading newspapers. The one I recommend is from the Guardian.¹⁵ The Guardian has in fact taken data and visualization to a new level. The newspaper even offers courses in data visualizations.¹⁶ A blog in December 2014 discussed the soaring rates of imprisonment in Australia. The narrative was supported by customizable graphics showing differences across states and territories.¹⁷

The Report Structure

Before starting the analysis, think about the structure of the report. Will it be a brief report of five or fewer pages, or will it be a longer document running more than 100 pages in length? The structure of the report depends on the length of the document. A brief report is more to the point and presents a summary of key findings. A detailed report incrementally builds the argument and contains details about other relevant works, research methodology, data sources, and intermediate findings along with the main results.

I have reviewed reports by leading consultants including Deloitte and McKinsey. I found that the length of the reports varied depending largely on the purpose of the report. Brief reports were drafted as commentaries on current trends and developments that attracted public or media attention. Detailed and comprehensive reports offered a critical review of the subject matter with extensive data analysis and commentary. Often, detailed reports collected new data or interviewed industry experts to answer the research questions.

Even if you expect the report to be brief, sporting five or fewer pages, I recommend that the deliverable follow a prescribed format including the cover page, table of contents, executive summary, detailed contents, acknowledgements, references, and appendices (if needed).

I often find the cover page to be missing in documents. It is not the inexperience of undergraduate students that is reflected in submissions that usually miss the cover page. In fact, doctoral candidates also require an explicit reminder to include an informative cover page. I hasten to mention that the business world sleuths are hardly any better. Just search the Internet for reports and you will find plenty of reports from reputed firms that are missing the cover page.

At a minimum, the cover page should include the title of the report, names of authors, their affiliations, and contacts, name of the institutional publisher (if any), and the date of publication. I have seen numerous reports missing the date of publication, making it impossible to cite them without the year and month of publication. Also, from a business point of view, authors should make it easier for the reader to reach out to them. Having contact details at the front makes the task easier.

A table of contents (ToC) is like a map needed for a trip never taken before. You need to have a sense of the journey before embarking on it. A map provides a visual proxy for the actual travel with details about the landmarks that you will pass by in your trip. The ToC with main headings and lists of tables and figures offers a glimpse of what lies ahead in the document. Never shy away from including a ToC, especially if your document, excluding cover page, table of contents, and references, is five or more pages in length.

Even for a short document, I recommend an abstract or an executive summary. Nothing is more powerful than explaining the crux of your arguments in three paragraphs or less. Of course, for larger documents running a few hundred pages, the executive summary could be longer.

An introductory section is always helpful in setting up the problem for the reader who might be new to the topic and who might need to be gently introduced to the subject matter before being immersed in intricate details. A good follow-up to the introductory section is a review of available relevant research on the subject matter. The length of the literature review section depends upon how contested the subject matter is. In instances where the vast majority of researchers have concluded in one direction, the literature review could be brief with citations for only the most influential authors on the subject. On the other hand, if the arguments are more nuanced with caveats aplenty, then you must cite the relevant research to offer the adequate context before you embark on your analysis. You might use literature review to highlight gaps in the existing knowledge, which your analysis will try to fill. This is where you formally introduce your research questions and hypothesis.

In the methodology section, you introduce the research methods and data sources you used for the analysis. If you have collected new data, explain the data collection exercise in some detail. You will refer to the literature review to bolster your choice for variables, data, and methods and how they will help you answer your research questions.

The results section is where you present your empirical findings. Starting with descriptive statistics (see Chapter 4, “Serving Tables”) and illustrative graphics (see Chapter 5, “Graphic Details” for plots and Chapter 10, “Spatial Data Analytics” for maps), you will move toward formally testing your hypothesis (see Chapter 6, “Hypothetically Speaking”). In case you need to run statistical models, you might turn to regression models (see Chapter 7, “Why Tall Parents Don’t Have Even Taller Children”) or categorical analysis (see Chapters 8, “To Be or Not to Be” and 9, “Categorically Speaking About Categorical Data”). If you are working with time series data, you can turn to Chapter 11, “Doing Serious Time with Time Series.” You can also report results from other empirical techniques that fall under the general rubric of data mining (see Chapter 12, “Data Mining for Gold”). Note that many reports in the business sector present results in a more palatable fashion by holding back the statistical details and relying on illustrative graphics to summarize the results.

The results section is followed by the discussion section, where you craft your main arguments by building on the results you have presented earlier. The discussion section is where you rely on the power of narrative to enable numbers to communicate your thesis to your readers. You refer the reader to the research question and the knowledge gaps you identified earlier. You highlight how your findings provide the ultimate missing piece to the puzzle.

Of course, not all analytics return a smoking gun. At times, more frequently than I would like to acknowledge, the results provide only a partial answer to the question and that, too, with a long list of caveats.

In the conclusion section, you generalize your specific findings and take on a rather marketing approach to promote your findings so that the reader does not remain stuck in the caveats that you have voluntarily outlined earlier. You might also identify future possible developments in research and applications that could result from your research.

What remains is housekeeping, including a list of references, the acknowledgement section (acknowledging the support of those who have enabled your work is always good), and appendices, if needed.

Have You Done Your Job as a Writer?

As a data scientist, you are expected to do a thorough analysis with the appropriate data, deploying the appropriate tools. As a writer, you are responsible for communicating your findings to the readers. Transport Policy, a leading research publication in transportation planning, offers a checklist for authors interested in publishing with the journal. The checklist is a series of questions authors are expected to consider before submitting their manuscript to the journal. I believe the checklist is useful for budding data scientists and, therefore, I have reproduced it verbatim for their benefit.

1. Have you told readers, at the outset, what they might gain by reading your paper?

2. Have you made the aim of your work clear?

3. Have you explained the significance of your contribution?

4. Have you set your work in the appropriate context by giving sufficient background (including a complete set of relevant references) to your work?

5. Have you addressed the question of practicality and usefulness?

6. Have you identified future developments that might result from your work?

7. Have you structured your paper in a clear and logical fashion?

BUILDING NARRATIVES WITH DATA

Talking about the deliverable is one thing, seeing it is another. In this section, I reproduce brief reports (averaging around 1,000 words), which are based on data and analytics to reinforce the key message. In fall 2011, I started writing a weekly blog for a newspaper in Pakistan, the Dawn. The syndicated blog focused on socio-economic and security challenges that continue to stall growth and development of a 180-million strong nation. Given the digital format of the blogs, I included tables and figures in my writings that are showcased every Wednesday on the newspapers’ website, www.dawn.com. Later, I started writing for the Huffington Post and a Canadian news channel, Global TV. The approach, though, remained the same across the three channels: I searched for data and generated tables and figures to illustrate my arguments.

In this section, I reproduce slightly modified versions of some blogs to serve as an example for deliverables that data scientists and analysts are expected to produce. Recall the key message that I repeat in this book: It is not for the lack of analytics or algorithms that data are not effectively turned into insights. Instead, it is the lack of the ability to communicate the findings from analytics to the larger audience, which is not fluent in analytics, that prevents data- and evidence-driven decision-making. A data scientist therefore is the lynchpin that connects raw data to insights. The analytics-driven reports in blog format presented next illustrate how to use data and analytics to communicate the key message.

Here are the key features that are common in the blogs presented:

1. Though the strength of the argument relies on the power of the narrative, graphics and tables bolster the key message.

2. Although the tables and figures are simple and easy to comprehend, which they should be for effective communication, the analytics used to generate them might be more involved and comprehensive.

3. You do not need to generate all analytics by yourself. A good data scientist should have the nose to go after processed insights generated by the larger fraternity. Remember, the goal is not to reinvent the wheel!

The blogs presented here are organized along four key themes that intersect with various dimensions of human development challenges, including urban transport and housing, human development in South Asia, and the trials and tribulations of immigrant workers. I begin, though, with the familiar discourse on big data.

“Big Data, Big Analytics, Big Opportunity”

The world today is awash with data. Corporations, governments, and individuals are busy generating petabytes of data on culture, economy, environment, religion, and society. While data has become abundant and ubiquitous, data analysts needed to turn raw data into knowledge are in fact in short supply.

With big data comes the big opportunity for the educated middleclass in the developing world where an army of data scientists can be trained to support the offshoring of analytics from the western countries where such needs are unlikely to be filled from the locally available talent.

The McKinsey Global Institute in a 2011 report revealed that the United States alone faces a shortage of almost 200,000 data analysts.¹⁸ The American economy requires an additional 1.5 million managers proficient in decision-making based on insights gained from the analysis of large data sets. Even when Hal Varian,¹⁹ Google’s famed chief economist, profoundly proclaimed that “the real sexy job in 2010s is to be a statistician,” there were not many takers for the opportunity in the West where students pursuing degrees in statistics, engineering, and other empirical fields are small in number and are often visa students from abroad.

A recent report by Statistics Canada revealed that two-thirds of those who graduated with a PhD in engineering from a Canadian University in 2005 spoke neither English nor French as mother tongue. Similarly, four out of 10 PhD graduates in computers, mathematics, and physical sciences did not speak a western language as mother tongue. In addition, more than 60 percent of engineering graduates were visible minorities, suggesting that the supply chain of highly qualified professional talent in Canada, and largely in North America, is already linked to the talent emigrating from China, Egypt, India, Iran, and Pakistan.

The abundance of data and the scarcity of analysts present a unique opportunity for developing countries, which have an abundant supply of highly numerate youth who could be trained and mobilized en masse to write a new chapter in modern-day offshoring. This would require a serious rethink for the thought leaders in developing countries who have not taxed their imaginations beyond dreaming up policies to create sweatshops where youth would undersell their skills and see their potential wilt away while creating undergarments for consumers in the west. The fate of the youth in developing countries need not be restricted to stitching underwear or making cold calls from offshored call-centers in order for them to be part of the global value chains.²⁰ Instead, they can be trained as skilled number crunchers who would add value to the otherwise worthless data for businesses, big and small.

A Multi-Billion Dollar Industry

The past decade has witnessed a major evolution of some very large manufacturing firms known in the past mostly for hardware engineering that are now transforming into service-oriented firms to provide business analytics. Take IBM, for example, which specialized as a computer hardware company producing servers, desktop computers, laptops, and other supporting infrastructure. That was the IBM of yesteryears. Today, IBM is a global leader in analytics. IBM has divested from several hardware initiatives, such as manufacturing laptops, and has instead spent billions in acquisitions to build its analytic credentials. For instance, IBM has acquired SPSS for over a billion dollars to capture the retail side of the Business Analytics market.²¹ For large commercial ventures, IBM acquired Cognos^® to offer full service analytics. The aggressive acquisitions continue to date.

In 2011 alone, the software market for business analytics was worth over $30 billion.²² Oracle ($6.1bn), SAP ($4.6 bn), IBM ($4.4 bn), and Microsoft and SAS each with $3.3 bn in sales led the market. It is estimated that the sale of business analytics software alone will hit $50 billion by 2016. Dan Vesset of IDC, a company known for gauging industry trends, aptly noted that business analytics had “crossed the chasm into the mainstream mass market” and the “demand for business analytics solutions is exposing the previously minor issue of the shortage of highly skilled IT and analytics staff.”

In addition to the bundled software and service sales offered by the likes of Oracle and IBM, business analytics services in the consulting domain generated additional several billion dollars worldwide. While the large firms command the lion’s share in the analytics market, the billions left as crumbs by the industry leaders are still a large enough prize for the startups and small-sized firms to take the analytics plunge.

Several Billion Reasons to Hop on the Analytics Bandwagon

While the IBMs of the world are focused largely on large corporations, the analytics needs for small and medium-sized enterprises (SMEs) are unlikely to be met by IBM, Oracle, or other large players. Cost is the most important determinant. SMEs prefer to have analytics done on the cheap while the overheads of the large analytics firms run into millions of dollars thus pricing them out of the SME market. With offshoring comes the access to affordable talent in developing countries who can bid for smaller contracts and beat the competition in the West on price, and over time on quality as well.

The trick therefore, is to exist along the large enterprises by not competing against them. Realizing that business analytics is not a market, but an amalgamation of several types of markets focused on delivering value-added services involving data capture, data warehousing, data cleaning, data mining, and data analysis, developing countries can carve out a niche for themselves by focusing exclusively on contracts that large firms will not bid for because of their intrinsic large overheads.

Leaving the fight for top dollars in analytics to the top dogs, a cottage industry in analytics could be developed in the developing countries that might strive to serve the analytics need of SMEs. Take the example of the Toronto Transit Commission (TTC), Canada’s largest public transit agency with annual revenues exceeding a billion dollars. When TTC needed to have a mid-sized database of almost a half-million commuter complaints analyzed, it turned to Ryerson University, rather than a large analytics firm.²³ TTC’s decision to work with Ryerson University was motivated by two considerations. First is the cost. As a public sector university, Ryerson believes strongly in serving the community and thus offered the services gratis. The second reason is quality. Ryerson University, like most similar institutions of higher learning, excels in analytics where several faculty members work at the cutting edge of analytics and are more than willing to apply their skills to real-life problems.

Why Now?

The timing had never been better to engage in analytics on a very large scale. The innovations in Information and Communication Technology (ICT) and the ready availability of the most advanced analytics software as freeware allows entrepreneurs in developing countries to compete worldwide. The Internet makes it possible to be part of global marketplaces with negligible costs. With cyber marketplaces such as Kijiji and Craigslist, individuals can become proprietors offering services worldwide.

Using the freely available Google Sites, one can have a business website online immediately at no cost. Google Docs, another free service from Google, allows one to have a free web server to share documents with collaborators or the rest of the world. Other free services, such as Google Trends, allow individual researchers to generate data on business and social trends without needing subscriptions for services that cost millions. For instance, Figure 3.1 has been generated using Google Trends showing daily visits to the websites of leading analytics firms. Without free access to such services, access to the data used to generate the same graph would carry a huge price tag.

Figure 3.1 Internet traffic to websites of leading analytics firms

Similarly, another free service from Google allows one to determine, for instance, which cities registered the highest number of search requests for a search term, such as business analytics. It appears that four of the top six cities where analytics are most popular [July 2012] are located in India, which is evident from Figure 3.2 where search intensity is mapped on a normalized index of 0 to 100.

Figure 3.2 Cities with highest number of searches for big data

The other big development of the recent times is freeware that is leveling the playing field between the analytics’ haves and have-nots. In analytics, one of the most sophisticated computing platforms is R,²⁴which is freely available. Developers worldwide are busy developing the R platform, which now offers over 6,000 packages for analyzing data. From econometrics to operations research, R is fast becoming the lingua franca for computing. R has evolved from being popular just among the computing geeks to having its praise sung by the New York Times.²⁵

R has also made some new friends, especially Paul Butler,²⁶ a Canadian undergraduate student who became a worldwide sensation (at least among the geek fraternity) by mapping the geography of Facebook. While being an intern at Facebook, Paul analyzed petabytes of data to plot how friendships were registered in Facebook. His map (see Figure 3.3) became an instant hit worldwide and has been reproduced in publications thousands of times. You might be wondering what computing platform Paul used to generate the map. Wonder no more, the answer is R.

Figure 3.3 Facebook friendship networks mapped across the globe

R is fast becoming the preferred computing platform for data scientists worldwide (see Figure 3.4). For decades, the data analysis market was ruled by the likes of SAS, SPSS, Stata, and other similar players. R has taken over the imagination of data analysts as of late who are fast converging to R. In fact, most innovations in statistics are first coded in R so that the algorithms become freely available immediately to all.

Source: r4stats.com/articles/popularity/

Figure 3.4 R gaining popularity over other statistical software

Another advantage of using R is that training materials are widely available on the Internet, including videos on YouTube.²⁷

Where to Next

The private sector has to take the lead for business analytics to take root in emerging economies. The governments could also have a small role in regulation. However, the analytics revolution has to take place not because of the public sector, but in spite of it. Even public sector universities in developing countries cannot be entrusted with the task where senior university administrators do not warm up to innovative ideas unless they involve a junket in Europe or North America. At the same time, the faculty in public sector universities in developing countries is often unwilling to try new technologies.

The private sector in developing countries might want to launch first an industry group that takes upon the task of certifying firms and individuals interested in analytics for quality, reliability, and ethical and professional competencies. This will help build confidence around national brands. Without such certification, foreign clients will be apprehensive to share their proprietary data with overseas firms.

The private sector will also have to take the lead in training a professional workforce in analytics. Several companies train their employees in the latest technology and then market their skills to clients. The training houses would therefore also double as consulting practices where the best graduates might be retained as consultants.

Small virtual marketplaces could be set up in large cities where clients can put requests for proposals and pre-screened, qualified bidders can compete for contracts. The national self-regulating body might be responsible for screening qualified bidders from its vendor-of-record database, which it would make available to clients globally through the Internet.

Large analytics firms believe the analytics market will hit hundreds of billions in revenue in the next decade. The abundant talent in developing countries can be polished into a skilled workforce to tap into the analytics market to channel some revenue to developing countries while creating gainful employment opportunities for the educated youth who have been reduced to making cold calls from offshored call centers.

Urban Transport and Housing Challenges

In addition to death and taxes, traffic congestion is the third inescapable fact of life for those who live in large cities. The following pieces analyze the challenges faced by road warriors and those who struggle to tap into the prohibitively expensive urban housing markets in the developed world.

“You Don’t Hate Your Commute, You Hate Your Job!”

You don’t hate your commute, it’s your job. A Statistics Canada survey revealed that workers who disliked their jobs were much more likely to hate their commutes than those who liked their jobs. Our hatred of the morning commute might be driven by our unsatisfactory jobs.

The General Social Survey in 2005, and later in 2010, quizzed thousands of Canadians about their satisfaction with various aspects of their lives, including commuting. At least 64 percent of workers who greatly disliked their jobs were displeased with their commutes.²⁸ On the other hand, only 10 percent of those who greatly liked their jobs reported disliking their commutes.

Extensive surveys of workers in Canada have revealed that our love-hate relationship with daily commutes is much more nuanced than what we had believed it to be. Furthermore, public transit riders, who spend 63 percent more time commuting to work than those who commute by car, dislike commuting even more, even when they commute for shorter distances than commuters by car do. Lastly, commuters who face frequent congestion, and not necessarily longer commutes in terms of distance or duration, report much higher levels of stress than others. These findings suggest that the urban transport planning discourse should move beyond the diatribe on commute times and instead focus on ways of making commutes more predictable rather than shorter.

Martin Turcotte,²⁹ a senior analyst with Statistics Canada, has produced several insightful reports on commutes to work in Canada, relying on two separate waves of the General Social Survey. Mr. Turcotte is one of the first few researchers who pointed out the longer commutes by public transit relative to those by car. His findings about public transit’s tardiness, which most transit enthusiasts hate to acknowledge, have largely been ignored by those active in planning circles and municipal politics. Instead, an ad nauseam campaign against excessive commute times, wrongly attributed to cars, has ensued.

The dissatisfaction with work seems to be playing into the dislike of commuting. Mr. Turcotte used mathematical models to show that even when one controls for trip length and duration, those who disliked their jobs were much more likely to hate their commutes. It appears that our inherent dissatisfaction with present employment is manifesting in more ways than we might acknowledge. The anticipation of working with colleagues one despises or a supervisor one hates is contributing to one’s lack of satisfaction with the daily commute (see Figure 3.5).

Source: Martin Turcotte. Statistics Canada. Catalogue No. 11-088.

Figure 3.5 The interplay between dislike for commuting and paid employment

A key finding of the 2005 survey is that city slickers hated their commutes more than those who lived in small or mid-sized towns. This requires one to acknowledge the common sense explanations for why people hate commuting. The survey revealed that those who either commuted for longer durations or greater distances hated their commutes more than those who commuted for shorter durations or distances. At the same time, those who faced frequent congestion, and not necessarily longer commutes, hated their commutes even more.

The mobility dialogue in North America remains hyperbolic. The discourse in Canada’s largest city region of over 6 million, Toronto, is even more animated and exaggerated. The media has caught on to the notion that Toronto has the longest commute times in Canada, while ignoring the fact that Toronto is also the largest labor market in Canada. In fact, the sizes of regional labor markets explain the difference in average commute times across Canada. At the same time, most Torontonians might not know that commuters in other large cities are even more contemptuous of their daily commutes. The survey revealed that Toronto commuters were the least likely of the six large Canadian cities to hate their commutes (see Figure 3.6).

Source: Martin Turcotte. Statistics Canada. Catalogue No. 11-088.

Figure 3.6 Who hates commuting the most?

These findings should help contextualize transport planning discourse in Canada. Our focus should be on what we can change and improve rather than on what is beyond our abilities to influence. Transport planners cannot improve the employment satisfaction for millions of commuters, nor can they devise an urban transit system that will deliver shorter average travel times than cars. Therefore, the focus should be on making commutes bearable and predictable. This might cost much less than the expensive interventions we hear being proposed.

A transport network performance system that informs commuters in real time on their smart phones about accidents, breakdowns, congestion, and delays can be operationalized at a fraction of cost than building new freeways or subways. Consider that in the United States, smartphones, based on their location on the transport network, are delivered real-time Amber alerts about possible kidnappings of minors. Extend the system to include real-time alerts for the entire urban transport network and deliver it to commuters based on their real-time location. This will help commuters plan and execute their commutes better and will reduce their dissatisfaction with commuting.

Communication and information technology to improve commuting has been in existence for years. What has been missing is imagination and the drive to achieve the possible.

“Are Torontonians Spending Too Much Time Commuting?”

Toronto’s long commute times have become a constant refrain dominating the public discourse. Many believe that the commute times are excessive. However, if the laws of physics and common sense were to prevail, Toronto’s 33-minute one-way commutes make perfect sense.

Commuting and congestion in Toronto have long been a source of concern and debate. As early as in July 1948, the Globe and Mail published stories of gridlocked downtown Toronto, calling it the “suffering acres.” The recent discourse, however, is prejudiced against Toronto’s average commute times, which are the highest in Canada. Remedial measures to shorten Toronto’s long commutes are being proposed, whereas commuters (mostly drivers) and transport planners are being criticized for creating rather than solving the problem.

The simple fact that commute times increase with the size of the underlying labor force has been ignored altogether in the ongoing debate. Also overlooked is the fact Toronto’s long average commute times are partially due to the slower commutes of many public transit riders. In fact, the size of the labor force, transit ridership, and the presence of satellite commuter towns (for example, Oshawa) explain Toronto’s long commute times (see Figure 3.7). The focus should not necessarily be on reducing commute times, but instead on improving the quality of commutes and the reliability of their duration.

Source: National Household Survey, 2011

Figure 3.7 Commute times increase with the labor force size

Toronto Census Metropolitan Area (CMA) is the largest urban center in Canada, with over 3.5 million workers comprising its labor force. These workers and their jobs cannot be contained in a smaller space, such as London, Ontario, which boasts an average commute time of 21 minutes for a small labor force of 268,000 workers. Whereas Toronto’s labor force is 13 times larger, its average commute time is only 1.6 times greater than that of London. In fact, Toronto’s labor force is 54 percent larger than that of Montreal, Canada’s second largest employment hub. Still, the commute times in Toronto are merely 10 percent larger than that of Montreal.

Toronto’s success with public transit is another reason for its large commute times. Almost one in four work trips in Toronto CMA is made by public transit. The 2011 National Household Survey revealed that trips made by public transit were 81 percent longer than those by private car. This means that one in four trips in Toronto is made using a slower mode of travel (see Figure 3.8). Given our stated planning priority to move more commuters off cars and on to public transit, average commute times are likely to increase, and not decrease.

Source: National Household Survey, 2011.

Figure 3.8 Commute times increase with transit use

But how bad is it to have an average commute time of 33 minutes? Is Toronto’s average commute time anomalously larger than that of other comparable metropolitan areas in Canada? The average commute time in Montreal of 30 minutes and that in Vancouver of 28 minutes, with significantly smaller respective labor forces, suggest that Toronto’s commute time is certainly not anomalous or an outlier.

In fact, simple statistical techniques applied to the commute time data for the 33 large CMAs in Canada revealed that the size of the labor force, the presence of satellite towns in the catchment of the large urban centers, and the public transit share explained 85 percent of the variance in commuting times across Canada. The analysis revealed that average commute times increased by a minute for every 2.7 percent increase in public transit ridership, after I controlled for the size of the labor force and the presence of satellite towns (see Figure 3.9).

Figure 3.9 Commute times are explained by labor force size and transit market share

Joel Garreau, in his seminal text, Edge City,³⁰ observed that throughout human history, and irrespective of transport technology, the longest desirable commute has been no more than 45 minutes. Toronto, or any other city in Canada, has not reached that critical threshold to become a historical anomaly. The fact that average commute times have approached 30 minutes in large urban centres in Canada is merely an artifact of the size of labor markets.

The debate in Canada should focus not on reducing commute times in large urban centers, which will be akin to setting an impossible goal. Instead, the focus should be on improving the quality of commutes, and more importantly, on making commute times more predictable. It is not the duration of commutes that stresses commuters more, but rather the unpredictability of their commute quality and duration.

“Will the Roof Collapse on Canada’s Housing Market?”

The word on the (Bay) street, and in the news media, is that of an overdue “correction” in Canada’s housing markets. Some analysts sound more alarming than others do and liken Canada’s housing market to that in Ireland. Others sound more calming and speak of a “soft landing.”

Households and investors worry that if the concerns of an inflated housing market were true, would it lead to a drastic collapse? Is the roof about to cave in?

A look at the long-term trends suggests that Canada might experience a “correction.” However, a housing market collapse similar to the one experienced by Ireland or the United States is unlikely to occur. Furthermore, Canada’s housing markets are largely swayed by a small number of relatively large markets in Greater Vancouver, Calgary, and Toronto. If there were to be a correction, it is most likely to be confined to those urban markets that have experienced above average gains in the past few years.

In a recent comment in the Globe and Mail,³¹ Tara Perkins compares Canada’s housing market to that of Ireland. A line in a 2007 report about Ireland’s housing market gave her the chills. It read: “Most available evidence would now appear to suggest that the housing market appears to be on the way to achieving a soft landing.” The real estate collapse in Ireland concerns Tara because she hears similar muted warnings of a soft landing for Canada.

The warnings might be similar, but that is where the similarities end. A look at the long-term trends in house price appreciation between the two housing markets reveals how different they are as one compares the Consumer Price Index (CPI) for housing for Canada and Ireland (see Figure 3.10). One simply cannot miss the spike in shelter costs in Ireland that began in early 2006 and lasted until the fall of 2008, returning a record 61 percent increase over a 34-month period. In comparison, the Canadian CPI for housing does not depict any sudden spikes, and instead suggests a gradual appreciation in shelter costs.

Source: Federal Reserve Bank of St. Louis

Figure 3.10 Housing price trends in Canada and Ireland

In the eight months following October 2008, the CPI for shelter in Ireland dropped from 143.4 to 93.3 (35 percent decline) causing havoc in the housing and other markets. Even a long-term review of Canadian housing markets do not reveal either a sudden exuberance-driven spike or a hungover crash in property markets like the one seen in Ireland.

The concern about Canadian housing markets is largely driven by the higher rates of house price appreciation in Canada’s large urban markets, such as Toronto, Vancouver, and Calgary. In the early eighties, the average housing prices in local housing markets were similar in magnitude to that of the overall Canadian average (see Figure 3.11). Afterward, however, local markets started to experience greater price appreciation over the Canadian average. Toronto and Vancouver took off in the mid-eighties, while Calgary broke away from the national average in 2005. Whereas Toronto experienced a sluggish housing price appreciation in the nineties, Vancouver continued to appreciate at faster rates, picking up even more steam starting in 2002.

Source: Canadian Real Estate Association

Figure 3.11 Housing price trends in Canadian cities

Another important factor to consider, which was also highlighted by Ms. Perkins in the Globe, is the record low mortgage rates that facilitate borrowing larger amounts for real estate acquisitions. Figure 3.12 confirms this fact. Since the early eighties, I see a gradual decline in mortgage rates along with a sustained increase in shelter costs.

Source: CANSIM

Figure 3.12 Shelter costs and mortgage rates in Canada

A sudden increase in mortgage rates could prevent a large number of households from servicing their mortgages. At the same time, we see an unprecedented increase in the household debt-to-GDP ratio for Canada, suggesting that Canadian households might have overborrowed and would feel the pinch if debt servicing becomes expensive. These are serious and legitimate concerns.

The other concern is about how much of the household debt is made up of the mortgage debt. If the housing prices collapse, which some economists have warned, what will it do to the loan-to-value ratios and default rates in Canada?

Given that we expect the population in Canada’s large urban centres to continue to increase by millions, mostly sustained by immigration, we are, to some extent, immune from the drying up of local housing demand. This suggests that as long as urban Canada is the preferred destination of immigrants from South and Southeast Asia, workers and capital will continue to flow to fuel Canadian housing markets. This is, however, more relevant to the markets in the short run.

In the long run, we will all be dead, and that is all we know for certain. In the meanwhile, it never hurts to keep your debt (housing or otherwise) in check.

Human Development in South Asia

The following pieces focus on the human and economic development challenges in South Asia, mostly Pakistan. I begin by favorably reviewing the progress made by Bangladesh in food security since its painful independence from Pakistan in 1971.

“Bangladesh: No Longer the ‘Hungry’ Man of South Asia”

In less than a quarter century, Bangladesh has outperformed Pakistan in reducing hunger and malnourishment. From trailing Pakistan in hunger reduction in 1990, Bangladesh has sped ahead of Pakistan and even India by halving hunger statistics.

The recently released Global Hunger Index (GHI) by the International Food Policy Research Institute reveals that hunger has improved globally since 1990.³² However, South Asia and sub-Saharan Africa are home to the worst forms of hunger. Estimates by the Food and Agriculture Organization suggest that no fewer than 870 million people go hungry across the globe.

The pejorative reference to the starving, naked Bangalis (Bhookay, Nungay Bengali) is still part of the Pakistani lexicon. The West Pakistan’s establishment thought not much of Bangladesh (before separation from Pakistan, Bangladesh was referred to as East Pakistan) when it separated after a bloody war that left hundreds of thousands of Bangladeshis and others dead. Fast forward to 2013 and a new picture emerges where Pakistan struggles to feed its people while Bangladesh gallops ahead in human development. One wonders why Pakistan, which was once thought to have so much promise, has become the sick (and hungry) man of South Asia.

The GHI explains how countries have performed over the past two decades in fighting hunger and disease. The report reveals the early gains made by South Asia in the 1990s to fight hunger and malnutrition. It was the same time when sub-Saharan Africa trailed far behind South Asia in human development. However, since 2000 sub-Saharan Africa has picked up pace and in 2013 it has on average performed better on hunger than the countries in South Asia.

Despite the slow growth in South Asia, Bangladesh is one of the top 10 countries that have made the most progress in reducing hunger since 1990 (see Figure 3.13). The Bangladeshi success with reducing hunger deserves a closer look to determine if this has resulted from sound planning or is merely a result of happenstance. Given that Bangladesh has beaten not just Pakistan, but also India, Nepal, and Sri Lanka in the pace at which it reduced hunger, the success is likely a result of good planning and execution.

Source: International Food and Policy Research Institute. (2013). 2013 Global hunger index: The challenge of hunger: Building resilience to achieve food and nutrition security.

Figure 3.13 Reduction in Global Hunger Index

The GHI is computed by averaging three indicators: prevalence of undernourishment in the population, prevalence of underweight children under five years, and under-five mortality rate (see Figure 3.14). The latest data reveals that compared to Pakistan, Bangladesh has lower prevalence of undernourishment in population and under-five mortality rate. However, Pakistan has slightly lower prevalence of underweight children under five years old.

Source: International Food and Policy Research Institute. (2013). 2013 Global hunger index: The challenge of hunger: Building resilience to achieve food and nutrition security.

Figure 3.14 Under-five mortality rate comparison

Research in northwestern Bangladesh suggests that hunger impacts are rather seasonal.³³ During times of low crop yields, food prices rise and result in lower accessibility to sufficient nutrition. The researchers found that the most food insecure are the perpetual poor. A combination of safety nets, set up by the government, and the use of micro-credit help the poor to manage food supply during lean periods.

Apart from safety nets, the improvement in hunger reduction is a result of several commitments and policies. The federal budget in Bangladesh has a dedicated provision for nutrition. Article 15 of the Bangladeshi constitution instructs the State to provide citizens with the necessities of life, including food. The Bangladeshi government in 2012 committed to food security “for all people of the country at all times.” The Bangladesh Integrated Nutrition Project (BINP) improved the nutritional outcomes as of 1995. Later in 2002, the National Nutrition Program was launched. Also included in the effort were Expanded Programme on Immunisation and vitamin A supplementation.

The preceding are some examples of how strategic planning resulted in faster reduction of hunger in Bangladesh. Despite the earlier-stated successes, 17 percent of the population in Bangladesh (25 million) continue to suffer from hunger. In fact, 41 percent of the under-five children are likely to be stunted and another 16 percent of under-five children are likely to be “wasted.” These numbers show that Bangladesh still has a long way to go in providing food security to its people.

Given that South Asian countries now lag behind sub-Saharan Africa in hunger reduction, it might be prudent for South Asian heads of state to join hands in collaborative efforts to feed the hungry. They might want to learn from the best practices in Bangladesh and elsewhere to protect the food insecure among them.

“Keeping Pakistan’s High Fertility in Check”

While contraceptives do help with family planning, what really helps is preventing women from marrying very young.

A survey in Pakistan revealed that women under 19 years of age at marriage were much more likely to give birth to five or more children than those who were at least 19 years old at marriage. The same survey also revealed that a visit by family planning staff did not have a significant impact on reducing fertility rates. Instead, women who watched family planning commercials on TV were much less likely to have very large families.

Being the sixth most populous nation in the world, Pakistanis are also exposed to disease, violence, and natural disasters, which increase the odds of losing children to accidents or disease. At the same time, many consider the use of contraceptives to be un-Islamic. In addition, the preference for a male offspring is also widespread. As a result, women in Pakistan give birth more frequently than the women in developed economies do. The immediate task for the State is to ensure that the rate of decline in the fertility rate observed over the past two decades continues. At the same time, the governments in Pakistan should learn from Bangladesh, which has made significant progress in stemming the population tide (see Figure 3.15).

Source: The World Bank (2013)

Figure 3.15 Fertility rates (births per woman)

Getting down to two children per family might seem an elusive target; however, Pakistanis have made huge dents in the alarmingly high fertility rates, despite the widespread opposition to family planning. Since 1988, the fertility rate in Pakistan has declined from 6.2 births per woman to 3.5 in 2009. In a country where the religious and other conservatives oppose all forms of family planning, a decline of 44 percent in fertility rate is nothing short of a miracle.

A recent paper explores the impact of family planning programs in Pakistan.³⁴ The paper uses data from the 2006–07 Pakistan Demographic and Health Survey, which interviewed 10,023 ever-married women between the ages of 15 and 49 years. The survey revealed that only 30 percent of women used contraceptives in Pakistan. The paper in its current draft has several shortcomings, yet it still offers several insights into what contributes to high fertility and highlights the effective strategies to check high fertility in Pakistan.

The survey revealed that the use of contraceptives did not have any significant impact for women who had given birth to six or more children. While 24 percent of women who did not use any contraceptives reported six or more births, 37 percent of those who used contraceptives reported six or more births. At the same time, 27 percent of women who were not visited by the family planning staff reported six or more births compared with 22 percent of women who had a visit with the family planning staff.

Meanwhile, demographic and socio-economic factors reported strong correlation with the fertility outcomes. Women who were at least 19 years old at marriage were much less likely to have four or more births than those who were younger at the time of marriage. Similarly, those who gave birth before they turned 19 were much more likely to have four or more births.

Education also reported strong correlation with fertility outcomes. Consider that 58 percent of illiterate women reported four or more births compared to 21 percent of those who were highly educated. Similarly, 60 percent of the women married to illiterate men reported four or more births compared to 39 percent of the women married to highly educated men. The survey revealed that literacy among women mattered more for reducing fertility rates than literacy among their husbands.

The underlying variable that defines literacy and the prevalence of contraceptives in Pakistan is the economic status of the households. The survey revealed that 32 percent of women from poor households reported six or more births compared to 21 percent of those who were from affluent households.

The preceding results suggest that family planning efforts in Pakistan are likely to succeed if the focus is on educating young women. Educated young women are likely to get married later and have fewer children. This is also supported by a comprehensive study by the World Bank in which Andaleeb Alam and others observed that cash transfer programs in Punjab to support female education resulted in a nine percentage point increase in female enrollment.³⁵ At the same time, the authors found that those girls who participated in the program delayed their marriage and had fewer births by the time they turned 19.

“In fact, women in Punjab with middle and high school education have around 1.8 fewer children than those with lower than middle school education by the end of their reproductive life. Simple extrapolations also indicate that the 1.4 year delay in marriage of beneficiaries associated with the program could lead to 0.4 fewer births by the end of their childbearing years.”

The religious fundamentalists in Pakistan will continue to oppose family planning programs. They cannot, however, oppose the education of young women. The results presented here suggest that high fertility rates could be checked effectively by improving young women’s access to education. At the same time, educated mothers are the best resource for raising an educated nation.

The Big Move

Economic migrants and globalization of trade and services are some of the defining characteristics of economic development that have taken place in the past few decades. Millions of economic migrants left their homelands in pursuit of better economic prospects abroad. Those who landed at Google or other similar successful IT firms wrote a new chapter in wealth generation. Many others, though, never got an opportunity to prove their skills. In fact, new immigrants defined the face of poverty in many large cities in North America and Europe. The material presented here offers an account of the rewards reaped and challenges faced by the new immigrants in their adopted homelands.

“Pakistani Canadians: Falling Below the Poverty Line”

Pakistan-born immigrants are the new face of poverty in urban Canada. The Canadian census revealed that 44 percent of Pakistan-born immigrants fell below the poverty line, making them the second most poverty-prone group of immigrants in Canada.

While they might project an aura of opulence during their visits back home, their life in Canada, however, is often full of struggle and frustration. Numerous Pakistani trained engineers, doctors, and PhDs drive cabs or work as security guards in large cities. In fact, one in three taxi-drivers in Canada was born in either India or Pakistan.³⁶ Several other immigrant professionals are unemployed, thus becoming a burden on Canadian taxpayers.

The Census data for income for 2005 revealed that Pakistan-born immigrants reported the second highest incidence for the low-income cut-off, a proxy for poverty in Canada. In comparison, only 18 percent of the India-born immigrants belonged to a low-income economic family. Immigrants born in the United Kingdom, Portugal, Italy, and Germany reported the lower incidence of poverty in Canada (see Figure 3.16).

Source: 2006 Public Use Microdata File, Statistics Canada

Figure 3.16 Percentage of low-income households by country of origin

Unlike in the Middle East where the Arab governments do not allow assimilation of migrant workers, the Canadian government, and the society, largely does not create systematic barriers that might limit the immigrants’ ability to succeed and assimilate. This is not to suggest that immigrants face no hurdles in Canada. They in fact do. For instance, foreign-trained doctors cannot practice medicine without completing further training in Canada. The shorter duration of medical training in Pakistan necessitates the additional certification for doctors. Engineering graduates from Pakistan, however, face no such barrier because the engineering curriculum and the duration of training in Pakistan are similar to that in Canada.

Despite the opportunities (and constraints), Pakistani-Canadians have not prospered as much as immigrants from other countries have. In 2005, wages earned by Pakistan-born immigrants were on average 70 percent of the wages earned by those born in Canada (see Figure 3.17). In comparison, wages earned by the India-born immigrants were 86 percent of the wages earned by Canadians. At the same time, immigrants born in America earned 20 percent more in wages than those born in Canada. Similarly, UK-born immigrants also reported on average higher wages than that of Canadian-born.

Source: 2006 Public Use Microdata File, Statistics Canada

Figure 3.17 Wages earned by immigrants and others

Because of lower wages, the Pakistan-born immigrants reported one of the lowest home-ownership rates. Only 55 percent of Pakistan-born immigrants owned their homes. In comparison, 75 percent of the India-born immigrants owned their homes. At the same time, while only 12 percent of the India- and Philippines-born immigrants had never been part of the workforce, 22 percent of the Pakistan-born immigrants in Canada reported never being in the workforce.

The difference in wages, home-ownership rates, and employment rates between immigrants from India and Pakistan extend beyond the economic spheres. Pakistan-born immigrants live in relatively large-sized families. Whereas only 13 percent of India-born immigrants lived in households of five persons or more, 44 percent of the Pakistan-born immigrants lived in households with five or more people.

Given similar cultural endowments, education, and language skills, it is important to explore why Pakistan-born immigrants in Canada have lagged behind their Indian counterparts. The Indian diaspora is much larger and has been established in Canada for a longer period, which has allowed immigrants from India to develop and benefit from the social networks required to establish in new employment markets.

The limited success of (mostly Asian and African) immigrants in the economic spheres and their modest assimilation in the mainstream Canadian culture has prompted the right-wing groups to launch campaigns against immigration to Canada. While opponents of immigration are mostly naïve and their recommendations to reduce immigration border on lunacy, given Canada’s demographic deficit, the fact remains that huge changes in the Canadian immigration policies are already taking place.³⁷ In Saskatchewan,³⁸ for instance, the provincial government in May 2012 changed the law that now prohibits immigrants from sponsoring their extended family members unless they secure a “high skill” job offer before arrival.

Since 2001, Pakistan has lost the most in its share of supplying immigrants to Canada.³⁹ Pakistan was the third largest source of immigrants to Canada in 2001 supplying 6.1 percent of the total immigrants. However, by 2010 Pakistan’s share of immigrants declined by 71 percent. Pakistan is no longer even in the top 10 sources of immigrants for Canada. At the same time, the Philippines experienced a 153 percent increase in its share of immigrants making it the biggest source of immigrants to Canada in 2010.

While there is no shortage of skilled applicants from Pakistan, it is hard to establish the precise reason for the declining number of emigrants from Pakistan. It could be that the dismal performance of Pakistan-based immigrants might have prompted the government to reduce the intake from Pakistan. It might also be true that the exponential increase in violence and militancy in Pakistan might have made the task of verifying credentials difficult.

Over the next 50 years, Canada will need millions more immigrants. The current and future lower fertility rates in Canada suggest that immigration is the only possible way of securing sufficient workers to sustain economic growth. Given the lackluster performance of Pakistani emigrants in Canada, it is unlikely that aspirants from Pakistan will get a chance to benefit from Canada’s demographic deficit.

“Dollars and Sense of American Desis”

American immigrants born in India outdo others in achieving economic success. Pakistan-born immigrants, while trailing behind Indians, also do better than the native-born Americans do.

The estimates reported in the 2010 American Community Survey revealed that the median salaried household income of India-born immigrants was around $94,700. In comparison, the median household income of native-born Americans was estimated at $51,750. Unlike the Pakistan-born immigrants in Canada, who lagged behind others in economic prosperity, Pakistani emigrants in America are relatively better off with their median household incomes 18 percent higher than that of the native-born Americans.

The American Community Survey for 2010 reveals that among the South Asians living in the U.S., India-born immigrants are far ahead of Pakistanis, Bangladeshis, and Afghanis (see Figure 3.18). Even when compared with immigrants from Egypt, a country known for supplying highly educated immigrants, Indian emigrants report exceptionally higher indicators of economic progress.

Source: American Community Survey, 2010

Figure 3.18 Median household income of immigrants and others in the U.S. (2010)

India-born immigrants also reported one of the lowest poverty rates at 4 percent (see Figure 3.19). On the other hand, Afghanistan-born immigrants reported the highest poverty rate where one in five Afghan immigrants was deemed below the poverty line in the U.S. Although Pakistan-born immigrants reported higher median household incomes than the native-born Americans did, surprisingly 14 percent of the Pakistan-born immigrants were below the poverty line compared to only 9.4 percent of the native-born Americans.

Source: American Community Survey, 2010

Figure 3.19 Incidence of poverty among immigrants and others in the U.S. (2010)

Another indicator of financial distress is the percentage of household income spent on gross rent. Households spending 30 percent or more of household income on rent are considered financially distressed. Among households who live in rental units, 57 percent of the immigrants from Pakistan, Bangladesh, and Egypt spent more than 30 percent of the household income on rent compared to only 24 percent of immigrants from India.

These poverty statistics raise several questions. For instance, despite having similar South Asian heritage, Pakistan-born immigrants report a 2.4-times higher rate of poverty than their Indian counterparts do. Furthermore, poverty among younger cohorts (18 years old or younger) is even worse among immigrants from Pakistan than from India. At the same time, almost 50 percent of under-18 Afghan immigrants are reportedly below the poverty line in the U.S. These statistics necessitate the need to explore the reasons behind the economic disparities among the immigrants from South Asia.

Let us undertake a socio-economic comparison of South Asians living in the U.S. I have restricted the reporting to immigrants originating from India, Pakistan, Bangladesh, and Afghanistan. This is done because India, Pakistan, Bangladesh, and to some extent Afghanistan have more in common in culture and recent history than other countries in South Asia. I have thrown in Egypt for good measure to serve as a control for immigrants from another Muslim-majority country with a different cultural background.

The purpose of this comparative review is to determine the reasons behind the success of India-born immigrants in the U.S. Could it be that the immigrants from India had luck on their side, or could it be that Indian immigrants possessed the necessary ingredients to succeed in the highly competitive labor market in the U.S.? More importantly, one needs to explore why immigrants from Pakistan and Bangladesh lag behind those from India in achieving the same levels of economic success.

Sizing the South Asians

With approximately 1.8 million individuals, India-born immigrants form the largest cohort among South Asians in the U.S. The American Community Survey (ACS) in 2010 estimated Pakistan-born immigrants at 300,000, Bangladesh-born immigrants at 153,000, and Afghanistan-born immigrants at 60,000. Egypt-born immigrants totaled 133,000. Immigrants from India were approximately five times the size of Pakistan-born immigrants. The relatively large size of Indian immigrants leads to larger social networks, which are helpful in seeking better employment and social prospects.

Despite their large population base, most India-born immigrants in the U.S. are recent arrivals. Whereas 47 percent of the India-born immigrants arrived in the U.S. after 2000, only 36 percent of the Pakistan-born immigrants arrived after 2000 (see Figure 3.20). This suggests that the economic success of immigrants from India is driven by the recent arrivals. Relatively speaking, immigrants from Afghanistan have enjoyed the longest tenure in the U.S. of the South Asian immigrants. Notice that although 42 percent of Afghan emigrants arrived in the U.S. before 1980, only 25 percent of the Indian emigrants accomplished the same.

Source: American Community Survey, 2010

Figure 3.20 Waves of immigration

Pakistanis Have Larger Families

With 4.3 persons per household, immigrants from Pakistan and Afghanistan reported significantly larger family sizes. In comparison, the native-born population reported an average household size of 2.6 persons whereas the size of India-born immigrant households was around 3.5 persons. The difference between immigrants from India and other South Asians is more pronounced when one looks at the per capita earnings. Owing to their smaller household size, immigrants from India reported significantly higher per capita incomes than other immigrants and even native-born Americans. For instance, Bangladesh-born immigrants reported 50% less in median per capita income than those from India. Moreover, although immigrants from Pakistan reported higher household incomes than the immigrants from Egypt, the larger household size of Pakistan-born immigrants brought their per capita incomes lower than that of the Egyptians.

Larger household size results in overcrowding, especially among low-income households, who often live in rental units. The average household size of rental households from Pakistan was 33 percent larger than the same for Indian emigrants in the U.S. Fifteen percent of the households from Pakistan were found to have more than one occupant on average per room compared to only 6 percent of those from India.

Women in the Labor Force

A key source of distinction between the immigrants from India and other South Asians is the higher participation of Indian women in the labor force. A much higher integration of women in the labor force is one of the reasons why immigrants from India have fared much better than others in the United States. Consider that only 42 percent of the women from Pakistan were active in the U.S. labor force compared to 57 percent of women emigrants from India. In fact, women from Pakistan reported one of the lowest labor force participation rates in the U.S., falling behind women from Egypt, Afghanistan, and Bangladesh.

Education Matters the Most

It should come as no surprise that emigrants from India are one of the most educated cohort in the United States. Almost 42 percent of the adult Indian emigrants had a graduate (master’s) or a professional degree. In comparison, only 10 percent of the native-born adults reported a graduate or professional degree. Approximately 23 percent of adult immigrants from Egypt and Pakistan reported the same.

The correlation between higher education attainment and higher median household incomes is explicitly presented in Figure 3.21. India-born immigrants with professional degrees also reported significantly higher incomes than other immigrants and native-born Americans. Immigrants from Afghanistan, with one of the lowest incidence of professional degrees, reported the lowest median household incomes.

Source: American Community Survey, 2010

Figure 3.21 Correlation between higher incomes and higher education

The gender divide is again instrumental between immigrants from India and other immigrants. Whereas 70 percent of India-born female adults held a bachelor’s degree or higher, only 46 percent of adult females born in Pakistan reported the same. At the same time, only 28 percent of the native-born female adults in the U.S. reported completing university education.

Better Education Better Careers

The education attainment levels among adult immigrants largely determine their career choices. University education resulting in professional or graduate degrees allows immigrants to qualify for well-paying jobs in the U.S. Immigrants from India have been able to use their high-quality education to make inroads into the high-paying employment markets. One is therefore hardly surprised to see that of the adult employed population, 70 percent immigrants from India are working in occupations focusing on management, business, science, and arts. In comparison, only 44 percent of immigrants from Pakistan and 33 percent of immigrants from Bangladesh are employed in similar occupations.

What We Have Learned

“Give me your tired, your poor, Your huddled masses yearning to breathe free, The wretched refuse of your teeming shore. Send these, the homeless, tempest-tost to me, I lift my lamp beside the golden door!”

In 1883, Emma Lazarus asked for the tired, the poor, and the wretched refuse. India instead sent her very best to the United States. Instead of the huddled masses, graduates from the Indian Institutes of Technology and Management landed in thousands at the American shores. These immigrants were products of a sophisticated higher education system whose foundations were laid by the first Indian Prime Minister, Pandit Nehru, in the early fifties.

In the rest of South Asia, especially in Pakistan and Bangladesh, education has never been a national priority. The results of such conflicting priorities are obvious. Graduates from Indian universities are outdoing others in competitive labor markets at home and abroad.

“The Grass Appears Greener to Would-Be Canadian Immigrants”

Canada should have gotten it right by now. A 146-year old country of immigrants should know how to integrate new immigrants. The recent census data, however, suggests that not to be the case.

While Canadians celebrated the 146th birthday of their country, many recent immigrants, however, had little to celebrate in their adopted homeland where their unemployment rate was 75 percent higher than that of the native-born Canadians.

Details from the 2011 National Household Survey (NHS) on labor outcomes paint a dismal picture for many immigrant groups, especially those considered a visible minority, a term referring to the people who visibly do not belong to the majority race at a place. For the would-be South Asian emigrants, the grass appears greener in Canada.

The labor force statistics from the NHS reveal the uneven geography of employment outcomes for various ethnic groups. More than one in four working-age Arabs, who migrated to Canada between 2006 and 2011, was unemployed (see Figure 3.22). During the same time-period, one in seven South Asian immigrants was also unemployed.

Source: Statistics Canada. National Household Survey, 2011.

Figure 3.22 Unemployment rates for visible minorities in Canada (2010)

The recent immigrants are most likely to experience adverse labor market outcomes, such as un- or under-employment. This is primarily a result of moving to a new place where one does not have social networks, one is unfamiliar with the system, and one’s professional credentials are either not recognized at all or are not recognized fast enough for one to have a career in one’s chosen field. The result of these limitations is that recent immigrants end up working odd jobs, trying to make ends meet. Eventually, they should be able to address these limitations and improve their employment prospects. For South Asian emigrants, this happens to be the case in Canada.

The unemployment rate of recent immigrants from South Asia, that is, those who arrived between 2006 and 2011, was 14.9 percent in 2011. The same for those who arrived between 2001 and 2005 was lower at 10.9 percent. Similarly, for South Asians who landed in the 90s, the unemployment rate was even lower at 9.2 percent. As for those who arrived in the 80s, the unemployment rate was 6.8 percent. Finally, for those who arrived before 1981, it was 5.9 percent.

These figures offer proof for the assimilation effect in labor market outcomes for immigrants. The longer the immigrants stay in the adopted homeland, the more knowledgeable they become of the rules and customs, and are more likely to succeed in the labor markets.

Despite the assimilation effect, immigrants classified as visible minorities continue to have larger unemployment rates than non-visible minority migrants. While 5.9 percent of the South Asian emigrants who arrived in Canada before 1981 were unemployed, only 5.1 percent of the non-visible minority immigrants (who arrived before 1981) were unemployed.

According to the NHS, the unemployment rates of immigrants also vary across Canada. The worst employment markets for South Asian emigrants were in Quebec. In Montreal, Quebec’s largest city, the unemployment rate for South Asian emigrants was 14.6 percent. On the other hand, the most favorable employment markets for South Asians were in the oil rich Alberta province. In Edmonton, Alberta’s second most populous city, the unemployment rate for South Asian emigrants was much lower at 5.9 percent in 2011. Similarly, the unemployment rate for Arab emigrants was over 16 percent in Quebec and around 9.5 percent in Alberta.

Education plays a role in securing better employment prospects. Highly educated immigrants with an earned doctorate or master’s degree had an unemployment rate of 5.2 percent and 7.2 percent, respectively. However, the unemployment rates for similarly educated non-immigrants in Canada were significantly lower. The unemployment rate for non-immigrants with an earned doctorate degree was merely 2.9 percent, suggesting that highly educated immigrants, such as PhDs, had a 79 percent higher unemployment rate than non-immigrants with similar credentials. Even worse, one in 10 recent immigrants who arrived in Canada between 2006 and 2011 and had an earned doctorate degree was unemployed.

Although the immigrants are able to improve their lot over time in their adopted homelands, the initial years of struggle are always painful. Secondly, immigrants are seldom able to plug the wage gap with the native-born, irrespective of their education and skills.

It is never an easy decision to begin with. However, as professionals chart out plans to migrate to foreign lands, they should know that the grass always appears greener on the other side of the fenced border!

“Bordered Without Doctors”

Despite thousands of distinguished doctors of Pakistani origin practicing in the U.S., Pakistani Americans are among the most deprived of healthcare services. Almost one in four Pakistanis is uninsured and thus enjoys no or limited access to the American healthcare system.

The data from the U.S. Census Bureau reveals that 23 percent of the 409,000 Pakistanis in the U.S. were uninsured in 2009, making Pakistanis the most uninsured cohort of all Asians in the U.S. Although the number of Pakistanis has doubled in the U.S. between 2001 and 2010, their per capita incomes continue to trail behind Indians and other immigrants from Asian countries, which keep a large cohort of Pakistanis uninsured in the U.S.

Immigrants from Pakistan are not alone in the healthcare wasteland in the U.S. Also performing poorly in healthcare accessibility are immigrants from Bangladesh, Korea, and Cambodia. India-born immigrants fare much better (see Figure 3.23). The percent of uninsured Indians in the U.S. is half of that of Pakistanis.

Source: A Community of Contrasts: Asian Americans in the United States (2011)

Figure 3.23 Uninsured American immigrants by place of origin

A report by the Asian American Center for Advancing Justice explores the uneven landscape of success and deprivation among South Asians in the U.S.⁴⁰ Although the American establishment selectively cites Asians as a successful example of the American melting pot, the reality, however, is quite nuanced. Unlike the softwared immigrants from Bangalore, whose economic success in the U.S. enjoys bipartisan acclaim, the plight of Bangladeshis or Hmongs in the U.S., who experience significantly more poverty than other immigrants, has largely been ignored.

Access to affordable healthcare has been a major policy hurdle for the U.S. where, unlike Canada, Australia, and most of Western Europe, universal healthcare does not exist. This implies that individuals have to purchase healthcare at market prices from vendors or purchase health insurance instead. The gainfully employed in the U.S. do not share this concern since their employers insure them and their families. However, for a sizeable low-income cohort of approximately 45 million Americans who are uninsured, the healthcare costs can be financially crippling.

A brief visit to a hospital in the U.S. for a few routine tests and an overnight stay can easily run into tens of thousands of dollars. Many, if not most, of the 45 million low-income uninsured Americans cannot afford even a single night’s stay in the hospital. Thus, the uninsured either forgo medical treatment or consult doctors only when it is necessary, which ultimately results in inferior health outcomes for the uninsured in America than the insured.

The free market enterprise in healthcare in the U.S. has created a system whose costs can no longer be borne by the society. The wage bill of healthcare professionals, mostly doctors, and the profits demanded by the pharmaceuticals and HMOs has left 45 million without adequate healthcare in one of the most resource-rich countries in the world.

Realizing the challenges faced by uninsured Americans, President Barack Obama has remained determined to extend the healthcare coverage to the 45 million uninsured Americans. The Obama Health Care Reform promises affordable healthcare for all by regulating the healthcare marketplace.⁴¹

One would have hoped to see universal support for universal healthcare in the U.S. What can be so wrong in extending the healthcare protection to 45 million uninsured in the U.S.? The reality, however, is quite different in the U.S. where the Obama Healthcare Reform has become a bone of contention between the Democrats (proponents) and the Republicans, who believe that the U.S. government and businesses cannot afford to pay the healthcare costs of their fellow (low-income 45 million) citizens. The Republicans challenged the proposed healthcare reforms in the U.S. Supreme Court, which decided in favor of President Obama’s plan and in the process diluted the plan as well. The equally split verdict between the Supreme Court judges (four judges voted to keep the plan while another four found the plan unconstitutional) needed Chief Justice John G. Roberts Jr.’s vote to break the tie. Fortunately, for President Obama, the Chief Justice favored the President’s plan.

While Americans have continued to debate and agonize over the merits of their healthcare system, the fact remains that theirs is one of the most expensive healthcare systems, even among the developed economies (see Figure 3.24).⁴² The U.S. spends 18 percent of its GDP on health, which is 50 percent more than that of Germany. Canadians also enjoy universal healthcare and spend only 11 percent of their GDP on health.

Source: WHO (Data obtained from Guardian)⁴³

Figure 3.24 Health spending levels in developed economies

A large number of Pakistanis are also included in the 45 million uninsured in the U.S. It is rather odd to see the 10,000-plus Pakistan-born influential and vocal physicians in the U.S. on one side and on the other the 100,000 Pakistanis with no access to healthcare who are ultimately left at the mercy of charities.

Although the Pakistan-born physicians in the U.S. continue to play a role in lobbying the U.S. government on its foreign policy, they might also want to consider expanding their efforts to extend healthcare to thousands of uninsured Pakistani-Americans who have been bordered without doctors.

SUMMARY

Imagine the joy the dancing daffodils brought to William Wordsworth, one of the greatest poets of the English language. As he walked through the countryside, he spotted golden daffodils dancing in the breeze. He couldn’t resist the splendor. He wrote:

Beside the lake, beneath the trees,

Fluttering and dancing in the breeze.

Continuous as the stars that shine

And twinkle on the Milky Way,

They stretch’d in never-ending line

Along the margin of a bay

Wordsworth even guesses their number, 10,000 flowers. But what is memorable is not the number of flowers, but the way he describes how the daffodils move and bring joy to his heart.

Ten thousand saw I at a glance,

Tossing their heads in sprightly dance.

The poem, “Daffodils,” circa 1804, is my favorite example of mixing the power of narrative with numbers to paint a picture that will forever remain embedded in the reader’s mind. I am not suggesting here that you consider inserting poetry in your reports. Instead, I want you to be open to creative thinking and powerful expression. The narrative detailing numbers need not be dry. As Wordsworth shows us in “Daffodils” wagering a guess about the number of flowers allows him to suggest they were in plenty. However, the real story was in the way the breeze and flowers joined in a “sprightly dance.”

In this chapter I have detailed the thesis statement about data science. I do not consider data science to be exclusively about big or small data, algorithms, coding, or engineering. I believe data science is about telling stories with data. The strength of data science lies in the power of the narrative.

There is no dearth of statisticians, computer scientists and engineers, and software developers. In fact, the tech bubble bust a decade earlier revealed that we might have graduated too many computer scientists. Still, the leading management consultants forecast a shortage of millions of workers who can put data to good use. I believe the real scarcity is of those who not only can analyze data, but who also could weave powerful data-driven narratives, illustrating their arguments with expressive graphics and tables.

This chapter focused on the deliverable; that is, a report or a presentation that constituted the final step in analytics. Too often the focus of data science texts is on the technical aspects focusing on statistics or programming. Seldom have such texts highlighted the need for storytelling with data. This book differs from others on data science by focusing on the deliverable and encouraging the reader to become a storyteller who uses numbers to capture the imagination of her readers.

ENDNOTES

1. Virgil Johnson personality sketch; Role as Mohawk Data Sciences Corp pres discussed. (1969, April 20). The New York Times, p. 3.

2. Bryce, G. R., Gould, R., Notz, W. I., and Peck, R. L. (2001). “Curriculum Guidelines for Bachelor of Science Degrees in Statistical Science.” The American Statistician, 55(1), 7–13.

3. Davenport, T. (2015, January 22). “Why Data Storytelling Is So Important, and Why We’re So Bad at It.” Retrieved September 19, 2015, from http://deloitte.wsj.com/cio/2015/02/10/why-data-storytelling-is-so-important-and-why-were-so-bad-at-it/?mod=wsjrc_hp_deloitte.

4. www.mckinsey.com/

5. www.deloitte.com/

6. tinyurl.com/deloitte-us-2014

7. We illustrate time series plots in Chapter 11.

8. Piketty, T. (2014). Capital in the Twenty-First Century. Translated by Arthur Goldhammer. Harvard University Press. Cambridge, MA.

9. Blau, F. D. and Kahn, L. M. (n.d.). “Gender differences in pay.” The Journal of Economic Perspectives: A Journal of the American Economic Association, 14(4), 75–99.

10. We conducted our searches on January 5, 2015. Your results for the same search might differ because of the ever-expanding corpus of Google Scholar.

11. paperpile.com

12. https://evernote.com/

13. See Chapter 7 for regression models.

14. See Chapter 6 for t-tests.

15. www.theguardian.com/data

16. www.theguardian.com/guardian-masterclasses/digital

17. www.theguardian.com/news/datablog/2014/dec/12/jail-rates-soar-in-states-and-territories-statistics-show

18. www.mckinsey.com/features/big_data

19. zomobo.net/play.php?id=D4FQsYTbLoI

20. www.flixster.com/movie/outsourced/

21. www-01.ibm.com/software/analytics/spss/

22. www.information-age.com/channels/information-management/perspectives-and-trends/2112153/analytics-software-market-hits-and3630-billion.thtml

23. www.ttc.ca/News/2012/June/0613_ryerson_ttc.jsp

24. cran.r-project.org/

25. www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=1&pagewanted=all

26. www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919

27. www.youtube.com/watch?v=sT7GzZsf3Hg&feature=channel&list=UL

28. www.statcan.gc.ca/pub/11-008-x/2006004/pdf/9516-eng.pdf

29. www.statcan.gc.ca/pub/11-008-x/2011002/article/11531-eng.pdf

30. www.amazon.ca/Edge-City-Life-New-Frontier/dp/0385424345

31. www.theglobeandmail.com/report-on-business/economy/housing/canada-better-served-learning-from-the-housing-crash-in-ireland/article19815811/

32. www.ifpri.org/publication/2013-global-hunger-index-0

33. www.tandfonline.com/doi/abs/10.1080/00220388.2012.720369#.UmdZd_ljtTo

34. papers.ssrn.com/sol3/papers.cfm?abstract_id=2172547

35. elibrary.worldbank.org/content/workingpaper/10.1596/1813-9450-5669

36. www.theglobeandmail.com/news/opinions/editorials/overqualified-immigrants-really-are-driving-taxis-in-canada/article2429356/

37. www.theglobeandmail.com/news/national/time-to-lead/rethinking-immigration-the-case-for-the-400000-solution/article2421322/

38.www.leaderpost.com/business/Protesters+outside+Regina+Legislature+immigration+rule+changes+betrayal/6626818/story.html

39. www.cic.gc.ca/english/resources/statistics/facts2010/permanent/10.asp

40. www.advancingjustice.org/pdf/Community_of_Contrast.pdf

41. www.standupforhealthcare.org/learn-more/quick-facts/12-reasons-to-support-health-care?gclid=CP7Y1NK__bECFRERNAod9BQAlQ

42. www.guardian.co.uk/news/datablog/2012/jun/30/healthcare-spending-world-country

43. www.guardian.co.uk/news/datablog/2012/jun/30/healthcare-spending-world-country

Chapter 7. Why Tall Parents Don’t Have Even Taller Children

You might have noticed that taller parents often have tall children who are not necessarily taller than their parents—and that’s a good thing. This is not to suggest that children born to tall parents are not necessarily taller than the rest. That may be the case, but they are not necessarily taller than their own “tall” parents. Why I think this to be a good thing requires a simple mental simulation. Imagine if every successive generation born to tall parents were taller than their parents, in a matter of couple of millennia, human beings would become uncomfortably tall for their own good, requiring even bigger furniture, cars, and planes.

Sir Frances Galton in 1886 studied the same question and landed upon a statistical technique we today know as regression models. This chapter explores the workings of regression models, which have become the workhorse of statistical analysis. In almost all empirical pursuits of research, either in the academic or professional fields, the use of regression models, or their variants, is ubiquitous. In medical science, regression models are being used to develop more effective medicines, improve the methods for operations, and optimize resources for small and large hospitals. In the business world, regression models are at the forefront of analyzing consumer behavior, firm productivity, and competitiveness of public- and private-sector entities.

I would like to introduce regression models by narrating a story about my Master’s thesis. I believe that this story can help explain the utility of regression models.

THE DEPARTMENT OF OBVIOUS CONCLUSIONS

In 1999, I finished my Masters’ research on developing hedonic price models for residential real estate properties.¹ It took me three years to complete the project involving 500,000 real estate transactions. As I was getting ready for the defense, my wife generously offered to drive me to the university. While we were on our way, she asked, “Tell me, what have you found in your research?” I was delighted to be finally asked to explain what I have been up to for the past three years. “Well, I have been studying the determinants of housing prices. I have found that larger homes sell for more than smaller homes,” I told my wife with a triumphant look on my face as I held the draft of the thesis in my hands.

We were approaching the on-ramp for a highway. As soon as I finished the sentence, my wife suddenly turned the car to the shoulder, and applied brakes. As the car stopped, she turned to me and said: “I can’t believe that they are giving you a Master’s degree for finding just that. I could have told you that larger homes sell for more than smaller homes.”

At that very moment, I felt like a professor who taught at the department of obvious conclusions. How can I blame her for being shocked that what is commonly known about housing prices will earn me a Master’s degree from a university of high repute.

I requested my wife to resume driving so that I could take the next ten minutes to explain her the intricacies of my research. She gave me five minutes instead, thinking this may not require even that. I settled for five and spent the next minute collecting my thoughts. I explained to her that my research has not just found the correlation between housing prices and the size of housing units, but I have also discovered the magnitude of those relationships. For instance, I found that all else being equal, a term that I explain later in this chapter, an additional washroom adds more to the housing price than an additional bedroom. Stated otherwise, the marginal increase in the price of a house is higher for an additional washroom than for an additional bedroom. I found later that the real estate brokers in Toronto indeed appreciated this finding.

I also explained to my wife that proximity to transport infrastructure, such as subways, resulted in higher housing prices. For instance, houses situated closer to subways sold for more than did those situated farther away. However, houses near freeways or highways sold for less than others did. Similarly, I also discovered that proximity to large shopping centers had a nonlinear impact on housing prices. Houses located very close (less than 2.5 km) to the shopping centers sold for less than the rest. However, houses located closer (less than 5 km, but more than 2.5 km) to the shopping center sold for more than did those located farther away. I also found that the housing values in Toronto declined with distance from downtown.

As I explained my contributions to the study of housing markets, I noticed that my wife was mildly impressed. The likely reason for her lukewarm reception was that my findings confirmed what we already knew from our everyday experience. However, the real value added by the research rested in quantifying the magnitude of those relationships.

Why Regress?

A whole host of questions could be put to regression analysis. Some examples of questions that regression (hedonic) models could address include:

• How much more can a house sell for an additional bedroom?

• What is the impact of lot size on housing price?

• Do homes with brick exterior sell for less than homes with stone exterior?

• How much does a finished basement contribute to the price of a housing unit?

• Do houses located near high-voltage power lines sell for more or less than the rest?

In this chapter, I begin by covering the history of regression models, and reviewing the initial studies done using this method. I introduce Sir Frances Galton, who studied the correlation between parents’ heights and that of their children. His study led to the very term regression to the mean. I briefly mention Carl Friedrich Gauss, the mathematician who devised the least square method, which has been widely used to estimate simple regression models.

However, I focus in detail on the concept of all else being equal. What sets regression models apart from almost all other statistical methods is that the regression models allow us to control for the impact of other factors while we study and analyze the impact of a particular factor.

Let us illustrate this with an example involving wage, education, and experience. It is safe to assume that additional years of schooling will result in higher wages. However, it may also be true that years of experience are also influential in determining one’s wage. In a regression model, we include both years of schooling and years of experience as explanatory variables to determine their impact on wages. A regression model enables us to account for the impact of years of schooling by controlling for years of experience. Similarly, we can determine the impact of years of experience on wages by holding years of education constant.

Housing price data plays a big role in this chapter. I use it to illustrate basic regression modeling techniques. Furthermore, I also use housing price data to demonstrate how to interpret the output from regression models.

The housing price data has only 88 observations. Because I have produced this book on data science in the age of big data, you may wonder why I use such a small data set.

As I have said earlier, when it comes to big data, it is not the size that matters, but how you use it. I believe that the small-sized data set of housing prices allows new learners to focus more on how to estimate, and more importantly, how to interpret the output from regression models than on the logistics of dealing with a large data set. My purpose of using the housing data is to illustrate the concept of all else being equal. Because the data set contains the necessary variables needed to illustrate not only regression, but also the main principle behind regression (all else being equal), I am pleased with my choice of the small data set.

Using the housing price data, I want to test the hypothesis that large-sized homes sell for higher prices. I have three proxies for size in the data set, namely the number of bedrooms, square footage of the lot, and square footage of the built-up space. I will explore what proxy of housing size is a more significant determinant of housing prices.

I also revisit the discussion on beauty and its socioeconomic outcomes. Recall the earlier discussion in Chapter 4, “Serving Tables,” about how instructors’ looks might influence their teaching evaluations. Simply put, do attractive instructors receive higher teaching evaluations from students? When I last discussed this topic, I conducted a series of tabulations and t-tests to explore bivariate relationships. This chapter takes the discussion to a new level involving multiple variables.

I now ask more nuanced questions. Does an instructor’s appearance and looks affect his or her teaching evaluations even when I hold other factors constant? The discussion in this chapter focuses on those “other factors.” For instance, if I were to hold gender, tenure status, proficiency in English language, visible minority status, and other attributes constant, would I still observe the impact of instructors’ looks on teaching evaluations? Hence, when I say all else being equal, I imply that I am controlling for other possible factors (in addition to the instructors’ looks) that could influence teaching evaluations.

Lastly, I analyze household expenditure data to determine what influences households’ spending on food and alcohol. This particular example allows one to appreciate how certain attributes affect consumption decisions. For instance, I would like to see whether the presence of children influences households’ spending on food. I also estimate a separate model to determine whether children’s presence in a household influences spending on alcohol.

INTRODUCING REGRESSION MODELS

I started this chapter with the discussion about tall parents often having tall children. That is, “[t] all couples almost always beget tall progeny.”² Again, if one looks at the average height of individuals across the globe, one sees that populations in certain countries are taller than in others. For instance, the average height for men in North America, especially those of European descent, is higher than the average height of men in East Asia. Such observations are certainly not new. In fact, the link between the parents’ height and that of their children resulted in the first manifestation of the regression model. In 1886, Sir Frances Galton published a paper in the Journal of the Anthropological Institute of Great Britain and Ireland on the very same topic. In his paper, which he titled “Regression towards Mediocrity in Hereditary Stature” he observed the following:

The average regression of the offspring is a constant fraction of their respective mid-parental deviations... So if its parents are each two inches taller than the averages from men and women, on average, the child will be shorter than its parents by some factors.

His research is of great interest because it highlights a profound observation about human genetics. Sir Frances Galton noted that if parents were on average taller than the relative population, their children would on average be shorter than their parents. Stated otherwise, children of very tall parents will be shorter and not taller than their parents.

Imagine for a second if the reverse were true. That is, the children born to taller parents ended up being taller than their own tall parents. The average height of the population would therefore increase in every successive generation. A few generations later, human beings would be slightly taller than their ancestors would be. If this were to continue for a few millennia, we would have much taller human beings than before. By having this tendency to regress to the mean height, which Sir Frances Galton called the regression towards mediocrity, we observe that unusually tall parents give birth to children who grow up to be shorter than their parents, thus keeping the height of the human race in check.

This is not to say that the average height of human beings has remained the same over centuries. In fact, the opposite is true. Over the past 150 years, the average height of humans in the industrialized countries has increased by approximately 10 centimeters (Hadhazy, 2015). The Dutch however “stand head and shoulders above all others.” The Dutch men or women today on average are 19 centimeters taller than their mid-nineteenth century counterparts.

The study of intergenerational height gains by Sir Frances Galton led to the birth of the regression models, which have become the workhorse of modern-day empirical research. While it is true that Sir Frances Galton first observed the concept of regression towards the mean, however, the mathematical formulation of the regression models was devised earlier by a brilliant mathematician, Carl Friedrich Gauss. The German mathematician devised the ordinary least squares (OLS) method in 1795. I discuss the OLS method in detail in the following sections.

Both Galton and Gauss have numerous accomplishments to their credit. While Sir Frances Galton is a well-known name in eugenics, his half-cousin, Charles Darwin, is known much more widely. In fact, it is believed that Sir Frances Galton discussed his research with Charles Darwin, who reportedly pushed him toward regression.

Gauss, on the other hand, is a celebrated mathematician. In fact, all texts in mathematics pay tribute to the genius of Gauss. However, one of his most commonly known and widely used discoveries is not directly attributed to him. The widely used normal distribution is in fact the Gaussian distribution, which he formulated.

All Else Being Equal

A fundamental concept in understanding regression analysis relates to all else being equal. Specifically, we want to isolate the impact of a particular factor on a behavior or phenomenon of interest while we control for other relevant factors. Consider wages as an example. One can argue that the years of schooling affect wages such that those with more years of schooling are likely to earn higher wages than those with fewer years of schooling. In fact, if one were to plot years of schooling and wages, one would observe a positive correlation between schooling and wages.

However, several other factors could be instrumental in determining one’s wages. For instance, gender could also be a determinant. Regrettably, we see gender discrimination in wages where for doing the same job, men earn more than women do. At the same time, we may observe that people with similar years of schooling earn different wages because of the difference in years of experience. Consider two employees, both with undergraduate degrees in business, with one earning much higher wages than the other does. It may turn out that the one with higher wages has far more experience the other employee.

We may also observe that even when gender and the years of schooling and experience are the same, two individual workers may still earn very different wages. For instance, two women workers, both with a master’s degree and five years of experience in their respective fields, may earn significantly different wages because one may have a graduate degree in arts while the other may have earned an MBA.

The all else being equal property of the regression models allows us to isolate the effect of one particular influence by controlling the impact of others. Earlier, we were interested in determining the impact of years of schooling on wages. I can use a regression model to control for gender, experience, and types of education or specialization to isolate the impact of years of schooling on wages. If I were to consider only two variables in isolation, that is, wages and years of schooling, I would have then ignored the impact of other factors that may be significant determinants of wages.

Regression models are therefore the right tool to conduct analysis with multiple variables. Such models allow one to control for a variety of factors on the variable or behavior of interest.

What Questions Can We Pose to Regression Models?

It is important to think of the questions that one can put to regression models. A better way to understand the regression modelling process is to think through the process of answering questions and riddles of interest. Because we will think systematically through the research question, an appropriate analytical framework will emerge from the process, which we then can subject to a regression model for empirical testing.

Let us first think of the difference in spending habits of men and women. For marketing professionals, several interesting questions may emerge from this line of inquiry. For instance, do women spend more on clothing than men do? One can compute average spending on clothing for men and women to answer this question. One can also think more holistically by entertaining other factors that could be influential in spending on clothing. For instance, one can refine the question by including the marital status to determine whether unmarried women spend more on clothes than married women do. Similarly, along with marriage, the presence of children in the household could also be a factor in spending decisions. One can pose the revised question as follows: Do married women with children spend less on clothing than married women without children do? Alternatively, do married women with children spend less on clothing than unmarried women do? Alternatively, do married women with children spend less on clothing than married men with children do?

Another way of thinking about the regression models is to consider the purchase of big-ticket items, such as cars and refrigerators. In 2008, the American economy in particular, and the global economy in general, experienced one of the worst recessions in decades. The economic slowdown in 2008 is commonly referred to as the Great Recession. Related to the Great Recession, one can explore whether households postpone purchase of expensive items during recession. One can further refine this question by asking whether low-income households postpone buying expensive goods or big-ticket items during recession. The reason for this qualifier is that recessions perhaps affect the discretionary spending of low-income households more so than it does that of the high-income households. Additionally, if we include the presence of young children, we can further refine the question and ask whether households with young children postpone buying expensive goods during recessions more so than households without children do. The presence of children in the household would affect discretionary spending because children’s expenses during recessions occur regardless, prompting households to cut discretionary spending.

Another example is that of couples postponing child rearing to later years in their lives. Demographers have advised that the average age of women at the birth of their first child has increased over the past few decades (Mathews and Hamilton, 2009).³ Demographers attribute this change to a large number of women opting for careers, which has delayed household formation, marriage or common law partnerships, and eventually the birth of children. However, asking this question in isolation would not do it justice. One has to account for other factors. For instance, one must consider whether both individuals in a relationship have demanding careers or otherwise. It may be true in some instances that only one of the two individuals has a demanding career that may not influence their decision to have children at a later stage in their lives.

Additionally, recall the discussion from Chapter 4 about instructors’ perceived attractiveness by students and its impact on their teaching evaluations. I presented a series of tabulations that showed how teaching evaluations differed, not just for the looks, but also by gender, visible minority status, fluency in English, and other related attributes of the instructor. A regression model is better suited to isolate the impact of an instructor’s perceived attractiveness on his or her teaching evaluation, while I control for other possible determinants of teaching evaluation including gender, English-language fluency, visible minority status, and the like.

If Suburbs Make Us Fat, They Must Also Make Us Pregnant

One can obtain partial answers to questions by relying on correlation analysis, which I briefly introduced in Chapter 6, “Hypothetically Speaking.” For instance, consider that researchers have found a positive correlation between obesity and suburban living (Vandegrift and Yoked, 2004).⁴ Similarly, we also observe a positive correlation between fertility and suburban living, which suggests that suburban women are more likely to have children than those who live in or near downtowns.^5, 6 Another example is that of poverty and fertility rates. Research has shown that low-income economies report higher fertility rates than high-income economies do.⁷ Stated otherwise, households in low-income countries are likely to have higher fertility than those living in high-income countries. What we must not forget is that these are all examples of correlations between two phenomenon observed without any consideration for other factors that might be at play.

As I have stated earlier, one can use correlation analysis to test the relationship between the aforementioned phenomena. However, one is likely to obtain a partial picture from the correlation analysis. Take the example of obesity and suburban living where a statistical test using correlation will confirm the relationship. However, one may not know the impact of age and income that might have confounded the relationship between obesity and suburban living.

Age and affordable housing likely influence the positive correlation between suburban living and higher fertility rates for women. We know from demographic data that in most North American cities younger cohorts are more likely to live in, or near, downtowns. Younger individuals, even when in committed relationships, often wait to have children later in their lives. Thus, younger couples live in smaller housing units because their shelter space requirements are less than that of the households with children.

From the above, we find that households with children require more space than those without children. Because suburban real estate is cheaper than one in or near downtowns (think Manhattan or Chicago), households with children or those who are expecting to have children relocate to suburbs in search of cheaper housing. This is partially the reason for why we observe a difference in fertility rates for urban and suburban women. As a result, we might find that affordable housing has more to do with higher fertility observed in suburbs than anything else does.

As individuals become older and have children, they require more shelter space. These households relocate to suburbs in search of affordable housing. At the same time, as one ages, the metabolism slows down. Additionally, one faces significantly more pressures for time, which result in less discretionary time. In other words, as individuals become older, their metabolism slows down and they have less time to do exercise. If you put all this together, you can see that even when a positive correlation exists between obesity and suburban living, it is explained by other factors, which are aging, slower metabolism, and less time available for exercise.

The obesity question informs us that correlation between two variables can offer only a partial answer. For a complete answer, we have to use a regression (or other multivariate) analysis to account for other relevant factors.

Holding Other Factors Constant

As I pointed out earlier, the strength of regression analysis lies in its ability to isolate the impact of one factor while we control for the impact of others. We call this all else being equal.

To understand this phrase, let us revisit the question about factors affecting women’s spending on clothing. One can think of several factors that could influence spending on clothing. Consider age, for instance, where younger women employed in the labor force may spend more on clothing if they are more responsive to the latest fashion trends. At the same time, women with established careers would have much more discretionary spending power than women who are just beginning their careers, and hence they may outspend the younger women for clothing. All else being equal, women earning higher incomes are more likely to spend more on clothing than others would. At the same time, the type of career may also influence women’s spending on clothing. Certain professions require one to maintain a professional look and may require higher spending on clothing and appearance.

Married women might also have access to their spouse’s income, which would further enhance their ability to spend on clothing. Conversely, they might have to support their spouses, which may leave less to spend on their own clothing. In addition, personal taste can also influence spending on clothing. For instance, women who prefer designer brands spend more on clothing.

Thus, if we are interested in determining what influences women’s spending on clothing, we have to account for the outlined factors. I admit that the factors mentioned do not constitute an exhaustive list of influences on clothing-related spending. The purpose is to illustrate the fact that mere correlation between two variables of interest—that is, spending on clothing and gender—may not be sufficient to paint a complete picture. We use regression analysis to overcome the limitations of bivariate analytical techniques, such as the correlation analysis.

All else being equal enables us to isolate one particular factor by holding the influences of others constant. I offer one final example to illustrate this point. You may have heard that houses near public transit facilities (metro or subway stations) sell for more than the rest. However, we may find that in addition to proximity to subway stations, such houses could be different in structural type, quality of construction, and size, which would affect housing prices. Proximity to public transit may only partially explain the difference in housing prices. All else being equal implies that when holding other factors constant, we can determine the impact of the proximity to subway station on housing prices.

Do Tall Workers Earn More Than the Rest? Not When All Else Is Equal

I share findings from a study on workers’ height and wages to reinforce all else being equal. Do you believe taller workers earn more than the rest? A quick answer is they do. Studies of workers’ earnings and their respective heights have revealed that taller workers earn more than their shorter colleagues do. How did these studies reach this conclusion? They used regression models, much the same way Sir Frances Galton did in 1886 to determine the relationship between parents’ heights and that of their children. Similarly, Professor Hamermesh and his undergraduate student Amy Parker also used regression models to conclude that good-looking instructors receive higher teaching evaluations. Maybe students learn more from good-looking instructors!

Schick and Steckel (2015) undertook a systematic study of earnings and height and concluded that “[t]aller workers receive a substantial wage premium.”⁸ In plain English, they observed that taller individuals earn more than others do. But this is not all they found. They also observed that when they controlled for other factors—that is, all else being equal—they found little if any evidence in support for height’s impact on earnings.

Consider the relationship between earnings and individual’s height. If we ignore other factors that influence an individual’s earning potential, we will erroneously conclude that the mere correlation between height and earnings perhaps implies causation. Schick and Steckel (2015), however, demonstrate that such a conclusion will be false. They identified other cognitive and non-cognitive determinants of earnings in addition to height.

They used a longitudinal data set in which all children born in the week of March 3, 1958 in Britain were tracked over time and were subsequently interviewed at ages 7, 11, 16, 23, 33, and 42. The final sample was much smaller than the starting sample in 1958, primarily because of the difficulty in keeping track of individuals over the long run.

The individuals were tested for scholastic aptitude (math and reading skills) and behavioral traits (extrovert, introvert, restless, and so on) during school years. Later surveys tested the adult respondents for problem-solving skills as well as for other proxies for motivation, such as pessimistic or optimistic outlook on life.

Using regression models, Schick and Steckel (2015) found that in the absence of other controls, height was a statistically significant determinant of earnings, suggesting taller workers earned more than shorter workers. However, the effect of height as an explanatory variable reduced when other controls were introduced in the model. Height remained a statistically significant determinant of earnings even when individual’s experience, region of residence, ethnicity, father’s socio-economic standing, and parents’ involvement and interest in the child’s education were controlled for in the model. But it all changed when workers’ math and reading scores, recorded at age 11, and problem-solving skills, recorded at age 33, were added to the model. Suddenly, with the inclusion of cognitive abilities, height became a statistically insignificant determinant of earnings. When non-cognitive measures—such as emotional stability and extraversion assessments made at ages 11, 16, 23, and 33—were included in the model, height remained a statistically insignificant determinant of earnings. In addition, the magnitude of the coefficient for height decreased, approaching almost zero.

Spuriously Correlated

Even when we find statistically significant correlation between two variables, it may turn out that the two variables might be completely unrelated. Consider the case of ice cream sales and drownings. One may find a statistically significant and positive correlation between drownings and the sale of ice cream. Can one assume that drownings are caused by ice cream sales? As a result, would one impose a restriction on ice cream sales to reduce deaths by drownings?

The preceding example depicts spurious correlation between two rather unrelated variables. During summer season, hot weather leads to higher ice cream sales. At the same time, people head to pools, lakes, rivers, and beaches for swimming. As more people swim, the odds of drowning increase. Hence, the positive correlation between ice cream sales and drownings has no causal linkage, except that both ice cream sales and drowning are influenced by hot weather.

While one acknowledges the utility of correlation analysis, spurious correlations and the presence of confounding or mitigating factors warn that correlation is not the same as causation, and hence, one has to undertake more involved and systematic analysis to determine the relationships between behaviors. Regression analysis is more apt for such analysis.

Another point to remember regarding spurious correlation is that this challenge will become pronounced with big data. Very large data sets by default will show some statistically significant correlations among rather unrelated variables. An example of spurious correlation could be found in Varian (2014) where Google Correlate finds high correlation between new homes sold in the U.S. and oldies lyrics.

Hal Varian, Google’s chief economist, while talking about Google Trend data, speaks of the emerging challenges posed by large-sized data and spurious correlation. He warns: “The challenge is that there are billions of queries so it is hard to determine exactly which queries are the most predictive for a particular purpose. Google Trends classifies the queries into categories, which helps a little, but even then we have hundreds of categories as possible predictors so that overfitting and spurious correlation are a serious concern.”⁹

A Step-By-Step Approach to Regression

Before I get to regression modeling, or any other empirical analytical technique, I need to have a well-defined question and a sound theoretical model. Recall the example of wage and gender bias discussed earlier under “All Else Being Equal.” I demonstrated how the research question was refined and a theoretical underpinning of the model was developed by considering factors other than gender that could influence an individual worker’s wage.

In Chapter 3, “The Deliverable,” I presented a detailed discussion of what considerations should precede any empirical analysis. I argued for the following:

1. Develop a well-defined research question.

2. Determine in advance what answers are needed.

3. Review research by others to determine what methodologies have been tested and what answers are already known.

4. Identify what information (data) is needed to answer the research question.

5. Determine what analytical methods, for example, regression, are most suited for the analysis.

6. Determine the format and structure of the expected final deliverable (report, essay, and so on).

I must insist that no amount of statistical sophistication can compensate for a poorly construed research question or an ill-devised theoretical model. After you have a sound theoretical model, you can adopt the following systematic approach to regression modeling.

Figure 7.1 shows a step-by-step approach to conduct regression analysis. I argue that the regression analysis comes at the end of a process that involves several other significant and intermediary steps. The first step involves having access to data. If you have not collected or acquired the data, the analytic process then begins with data collection.

Figure 7.1 Step-by-step approach to regression analysis

You must first define the population that needs to be analyzed. For instance, if a school board is interested in determining the factors influencing the dropout rates from schools in the district, the population for that project will include all schools in the district. The next step is to sample the population. One can conduct a random sample, a stratified random sample, or other advanced sampling procedures, which are beyond the scope of this text. Refer to Batterham and Greg (2005) for further readings on determining sample sizes for your work.¹⁰

After you have identified a representative sample, you must design, test, and execute a survey questionnaire, either in print or online. What follows is collecting responses and properly coding them in a database. After the database is ready, it is subjected to analytics, including regression modeling.

This book assumes you already have access to data. I therefore do not cover material on data collection and survey design. After the data is ready to be analyzed, you proceed with descriptive analysis. In this step, you generate tabulations, cross tabulations, and charts to summarize data. The purpose of descriptive analysis is to get appreciation for the data set, so that when you estimate models or conduct hypothesis testing, you possess an intuitive sense of what lies in the data.

Recall the earlier discussion in Chapters 4 and 5 about teaching evaluations. The tabulations in Chapter 4 and graphics in Chapter 5 offer examples of descriptive analysis.

Jumping straight to regression modelling could result in errors that may not become obvious to the analyst. Consider the case where age is coded in years and is recorded as 25, 26, and so on. A common practice is to code missing data as 99 in the data set. Thus, respondents who truly are 99 years old and those with missing data for age will all be coded 99. If one were to compute the average age, and assuming there are lots of missing data coded as 99, one will end up with an erroneous estimate of average age. A histogram of the age variable would have highlighted this anomaly by highlighting the unusually large number of respondents aged 99 during the descriptive analysis.

The next steps involve hypothesis testing and correlation analysis in a bivariate or univariate environment. I covered this in Chapter 6. Recall that tabulations in Chapter 4 showed a slight difference in the teaching ratings for male and female instructors. We may be interested in determining whether the difference in teaching evaluations was statistically significant on its own; that is, when we do not consider other relevant factors. I demonstrated how one could use a t-test to determine the statistical significance in average teaching evaluations between male and female instructors.

Finally, you estimate a regression model after conducting descriptive analyses, hypothesis testing, and have already developed some insights about the data. Given the ease with which regression models are estimated using off-the-shelf software, the tendency among many data scientists and researchers is to jump straight to the regression, skipping the necessary intermediary steps, such as hypothesis testing and descriptive analysis. Skipping the intermediary steps is akin to practicing bad data science.

You may wonder why I promote hypothesis testing using t-tests or other tests in a chapter on regression analysis. First, I must point out that regression modeling is also a formal way of conducting hypothesis testing. What differentiates regression from other bivariate tests is that regression models can simultaneously analyze more variables than just two or three. The reason I want you to still conduct hypothesis tests, as illustrated in Chapter 6, is to have the necessary ingredients in hand as you prepare the final deliverable—the report or an essay. I explain the reasons in the following paragraph.

Let us assume you are commissioned to analyze any potential gender bias in wages at a large organization. You are expected to deliver a report that will document your findings. The report will lay out the problem statement in the introduction. In the descriptive analysis section, you will refer to the wage differences that you may have observed between male and female employees. As part of the natural flow in the report, you will then entertain the question if the observed wage difference is statistically significant or otherwise. If the wage difference between males and females is not statistically significant, you may even decide to conclude your report with no further analysis. But, if the wage difference is statistically significant, which a t-test would have shown, you will then decide to proceed further with the analysis to explore the determinants of wage differences. Regression model could be the next step. However, the initial hypothesis testing using t-tests will provide the justification to proceed further with the analysis.

Learning to Speak Regression

Regression models use specific terms to describe the processes. To be proficient in regression analysis, we need to know the unique terms and be comfortable with their usage. The variable of interest, the one being analyzed, is called the dependent variable, which is usually denoted by Y. Other variables that explain the phenomenon or behavior of interest are called the explanatory variables or the independent variables, which are usually denoted by X. Earlier, I spoke of wage being defined by years of schooling. In this example, wage is the dependent variable or the response variable. Education attainment, measured in years of schooling, serves as the explanatory variable or the independent variable. Wage is also referred to as regressand and education is referred to as the regressor.

We often explain regression models by describing the relationship between the dependent variable and the explanatory variable/s. For wage and education, I describe the relationship as follows: Wage is a function of education. In simple terms, I am saying that wage is determined by years of schooling. If I was to represent the dependent variable (wage) as y and the explanatory variable (years of schooling) as x, I will say that y is a function of x.

There will be times when a dependent variable, such as wage, is being informed by more than one explanatory variable. In such circumstances, I would say that y is a function of x₁ and x₂. Mathematically, we use the notation shown in Equation 7.1:

Equation 7.1 suggests that the dependent variable y is informed by not one but two variables.

In the early days, regression models often used only a single variable as an explanatory variable. Lack of data and limited computing power made estimating large and complex models difficult or impossible. Because we now experience a data deluge, the abundance of data allows us to use not one or two, but several variables.

In the instance where we use only two explanatory variables x₁ and x₂, we can represent this as a statistical relationship, as shown in Equation 7.2.

y in Equation 7.2 is the dependent variable, and x₁ and x₂ are the explanatory variables.

Notice in Equation 7.2 the other entities that need explaining. β₁ relates to x₁ and it defines the relationship between x₁, which is an explanatory variable, and the dependent variable y while I control for the other explanatory variable, x₂. Similarly, β₂ accounts for the relationship between x₂, the second explanatory variable, and y, the dependent variable, while I control for the first explanatory variable, x₁.

β₀ is the conditional mean or the mean value of y when both x₁ and x₂ are set to 0. Recall the example of education and wages: If y is wage, x₁ is the years of schooling, and x₂ is the years of experience, then β₀ will represent the average wage of workers with zero years of schooling and no experience.

ε is the error term, which accounts for the residuals or what is not explained by the regression model. Let us assume that I have estimated a regression model using education and experience as the explanatory variables. Using the same model, I forecast the wages of individuals based on the years of schooling and years of experience. I then determine the difference between the actual value and forecasted values for wages. The difference between the two is called the error term or residuals.

I must mention that the regression model being explained here is often referred to as linear regression. This implies that the regression model will generate a straight (linear) “best-fit” line to approximate the relationship between the dependent and explanatory variables. Other more advanced regression models, not discussed in Chapter 7, are quite capable of capturing the inherently non-linear relationships.

The Math Behind Regression

The relation between variables can be described either as a functional relationship or a statistical relationship. I restrict the conversation here to a bivariate scenario where one variable is the dependent variable (y) and the other is the explanatory variable (x). A functional relationship between two variables is represented as a mathematical formula and is of the form shown in Equation 7.3:

I would interpret the preceding as y is a function of x, implying that y is explained by x.

Therefore, an example could be y = 2x, which suggests that when x increases by 1, y increases by 2. I plot the relationship in Figure 7.2.

Figure 7.2 Example of a functional relationship

Unlike a functional relationship, a statistical relationship is not a perfect one that would result in a straight line where the entire observed points lie on the line. Usually, observations in a statistical relationship do not fall directly on the line approximating the relationship. This is because a statistical relationship approximates a functional relationship. Also, whereas a functional relationship may return a clean straight line, a statistical relationship may not be as smooth as a functional relationship. I illustrate this in Figure 7.3, which shows that the observed points scatter around the straight line that approximates the relationship. In this particular case, I have plotted the relationship between housing prices (Y) and the built-up area in square feet (X). Equation 7.4 shows the hypothesized relationship:

Figure 7.3 A scatter plot of housing prices and sizes with a regression line

A regression line is very much like the line in Figure 7.3, which captures the statistical relationship between two variables. We can see that larger homes sell for higher price. The scattering of points around the line represents the variations in the dependent variable. We can see in Figure 7.3 that most points do not fall exactly on the regression line.

The regression model attempts to minimize the deviation (difference) between the actual observation and its corresponding location on the regression line. I have identified the deviation for one particular observation in Figure 7.3 and labeled it as Q. The process to generate a regression line that minimizes the sum of all such possible deviations is called the ordinary least squares method, which I will explain in a bit.

You have a choice at this stage in the chapter. You can proceed to the next section to learn about the guts of the regression models. However, if math was not your favorite subject in high school, I offer an alternative and encourage you to skip to the section, “Regression in Action.” For those who like to learn by doing and not be burdened by math, skipping to the aforementioned section is a good choice. Others are welcome to join us to learn about the math behind regression analysis in the following section.

Ordinary Least Squares Method

Let us consider a regression model with one explanatory variable. The dependent variable (y) is the housing price and the explanatory variable is square footage of the built-up area (x). The model is expressed as shown in Equation 7.5:

y_i represents the dependent variable for the i^th observation; that is, a particular house in the data set. y₁is therefore the price of the first housing unit in the data set. x_i stands for the explanatory variable in the i^th observation. x₁ is therefore the square footage of the built-up area of the first housing unit in the data set. Table 7.1 presents the first 10 observations from a sample data set of housing prices.

Table 7.1 A Sample of 10 Observations from a Dataset of Housing Prices

β₁ is the estimated parameter or the slope of the regression line. It represents the relationship between the explanatory variable (x), that is, square footage, and the housing prices. β₁ informs us of the impact of a change in square footage on housing prices. β₀ is the intercept of the regression line. It is also referred to as the constant. Theoretically speaking, it is the conditional mean of the dependent variable. It is the average value conditional upon the explanatory variables. In the housing example, β₀ represents the mean housing price conditional upon square footage of the built-up area. Mathematically, it is the mean price of housing when the built-up area equals 0, which effectively implies that β₀ represents the value of an empty lot with zero square footage for built-up space.

ε_i is the random error term for the i^th observation. It is also referred to as the residual and represents the difference between the actual price for a housing unit and the estimated price from a regression model. It is represented as Q in Figure 7.3. The regression model requires us to minimize the error term, or to have its expected value as 0. Mathematically, I represent the expected mean as E{ε_i} = 0.

Another property of the regression model requires the error term to have constant variance. This implies that the variance of the error term does not increase or decrease in a systematic manner. This property is called homoscedasticity, which I discuss later in this chapter. Mathematically, constant variance is expressed as σ²{ε_i} = σ², suggesting that variance does not change with observations.

Regression models require that individual observations are not correlated. In the housing example, I would like the price of one housing unit to be independent of another unit. In a regression model, such assumption implies that the error terms are not correlated. Mathematically, the covariance between ε_iand ε_j is expected to be 0; that is, σ{ ε_i,ε_j } = 0 for all i, j; i ≠ j. In simple terms, I have stated that the residuals are not correlated.

The model in Equation 7.5 is called a simple (only one predictor) linear in the parameters (because no parameter appears as an exponent or is multiplied or divided by another parameter), and linear in predictor variable (the explanatory variable appears only in the first power) regression model. It is also referred to as the first-order model.

Because the expected value of the random error term is zero (E{ε_i} = 0), mathematically I can derive the model as shown in Equation 7.6:

For the housing price example, Equation 7.7 shows the estimated regression equation.

The β₀ equals $16,246 in Equation 7.7. This is the value of a house with zero square footage of built-up space. Essentially, I am referring to the price of an empty lot. β₁ in Equation 7.7 equals $203. This implies that the price per square foot is $203. Thus, the price of a house with 1500 square feet in built space is estimated as follows:

y = 16246 + 203 * 1500 = 320746

Let us now see how the model is derived. We are interested in the deviation of y_i from its expected value obtained from the model, as shown in Equation 7.8.

We consider the sum of n such squared deviations, where n is the number of observations in the data set. We are squaring the deviations because some deviation will be positive and others negative. Squaring the deviations guarantees a positive value. Consider Equation 7.9:

The least squares method returns the estimates for β₀ and β₁ as b₀ and b₁ that minimize Q. Therefore, in Equation 7.10, and are the respective means of x and y:

Estimating Parameters by Hand

Let us consider a simple example of two variables x and y. I estimate the regression parameters b₀ and b₁ by hand and present the calculations in Table 7.2. The mean for x is 42.4 and for y is 13.9. The third column in the table presents the difference between x_i and the mean value for x. The fourth column does the same for y. The other column headings are self-explanatory.

Table 7.2 Regression Model Calculations for a Simple Example

Thus,

b₀ = 13.9 – .27835 * 42.4 = 2.0980

The estimated regression equation is expressed as shown in Equation 7.11:

is the value of the dependent variable predicted by the regression model for a given value of x.

Residuals (Error Terms) and Their Properties

The i^th residual is the difference between the observed value y_i and the corresponding value predicted by the model, . The estimated value for the first observation in Table 7.3 is expressed as follows:

= 2.098 + 0.279 * 35 = 11.84

Table 7.3 Calculations for Residuals and Forecasted Values

The residual is the difference between the estimated value and the actual observation, given by:

ε₁ = y₁ − ₁ = 12 − 11.84 = 0.16

Table 7.3 shows the calculations for the rest of the data.

The error term (residuals) obtained from a regression model has the following properties:

• Sum of all residuals equals 0; that is, .

• The sum of squared residuals, , is the minimum for the regression line. This implies that no other line will minimize the sum of squared residuals than the regression line.

• The sum of observed values equals the sum of the fitted values; that is, .

• The sum of the weighted residuals is zero; that is, and .

Is the Model Worth Anything?

The purpose of the regression analysis is not only to find the relationship between two or more variables, but also to determine whether the relationships are statistically significant. The statistical tests are based on the comparison between the predicted values and the observed values in the data set. Put simply, the analysis of the goodness of fit is primarily an analysis of the residuals.

In the end, we would like to say how good our model is. A commonly used proxy for the goodness of fit is the coefficient of determination. The following discussion leads to estimating the coefficient of determination.

Errors Squared

Let me first introduce the term Sum of Squared Errors (SSE), which is the sum of the squared residuals. Mathematically, SSE is represented as shown in Equation 7.12:

The SSE is used to determine the Mean Squared Error (MSE). I divide the SSE with the degrees of freedom (DOF) to determine MSE. In the preceding example, I have lost two DOF. The MSE is expressed as shown in Equation 7.13:

The model fit can be expressed as the Standard Error of the model, which is calculated by taking the square root of the MSE.

Coefficient of Determination r²

The purpose of a statistical model is to capture or at least approximate the dynamics of the underlying data. We are interested in explaining the variance in the dependent variable, which is a proxy for how well the model fits to data. The coefficient of determination (r²) is the most commonly used measure of the goodness of fit. It involves two entities, the Sum of Squared Errors, which I already explained, and the Total Sum of Squares (SSTO), which is the sum of squared differences between the dependent variables and its mean . Mathematically, the coefficient of determination is computed as shown in Equation 7.14:

The r² for our worked-out example is computed as follows:

I interpret the goodness of fit for our model as follows: The model explains 95% variance in the data. r²is bounded by 0 and 1 with higher values suggesting a better fit.

Numerous texts in statistics have mentioned several arbitrary thresholds for r². Some have gone as far as to suggest that an r² of less than 0.5 suggests a poor fit. Some have suggested that r² should be higher than 0.7 for a model to be considered a good fit. It is increasingly important to realize for data scientists, especially those who work with behavioral data, that there is absolutely no foundation for insisting on such arbitrary thresholds. The decision-making processes of human beings are very complex and it is naïve to expect a statistical model to explain 80 to 90% variance in human behavior.

One should always consult existing literature to see examples of model fits. In microeconomics, the model fits often range from 0.1 to 0.5. It is extremely rare to see a model with an r² of 0.9 in models related to human behavior. At the same time, in engineering studies involving measurements on materials, one finds significantly higher values for r².

Are the Relationships Statistically Significant?

After estimating a regression model, you would like to know whether the estimated coefficients bear any statistical significance. You need to undertake hypothesis testing to see whether the outcome of the model is an artifact of chance or it reflects a sound relationship. I will build on the discussion on hypothesis testing covered in Chapter 6. The fundamental concepts remain the same as in Chapter 6.

Recall Equation 7.7 from the housing price example in which I estimated values for b₀ and b₁. I found that housing price increases by $203 for a square foot increase in the built-up area and that the land value equaled $16,246. How would we know whether these estimates are statistically significant? Fortunately, we can conduct a t-test to determine the statistical significance of the estimated parameter.

Assume that the true value of an estimated parameter is zero, but we have obtained a different value from the regression model. We set the null hypothesis to state that the value of the estimated parameter is zero. The alternative hypothesis states that the true value does not equal zero. Equation 7.15shows this mathematically:

If β₁ = 0, it implies that there exists no linear relationship between the dependent and the explanatory variables. Let us first estimate the variance (s²{b₁}) of the sampling distribution of b₁, which is an estimate of β₁. Continuing with our worked-out example, we can obtain the standard error of the estimate s{b₁}as follows:

Statistical theory informs us that the point estimator s²{b₁} is an unbiased estimator of σ²{b₁}. The square root of s²{b₁} will give us the point estimator of σ{b₁}. I can then use t-statistic to test the hypothesis:

After I have obtained the t-statistic, I can use the following decision rules to test the hypothesis. You will notice that these tests are similar to the tests I conducted in Chapter 6.

Remember, our null hypothesis states that the estimated coefficient is equal to zero and that there is no relationship between the dependent and explanatory variables. I will conclude the null to be true if the calculated value for the test is less than the critical value obtained from the t-distribution for the relevant degrees of freedom. Equation 7.16 shows that, mathematically, for a two-tailed test:

α is the level of significance, n is the number of observations, and k is the lost degrees of freedom. For a 95% level of significance, 10 observations, and 2 lost degrees of freedom, the critical value from t-distribution is

For n = 10; k = 2; α = 0.05,

Because 12.536 > 2.306, I conclude H_a and reject the null hypothesis of no statistical significant relationship between the dependent and the explanatory variable. Note that I cannot conclude that we accept the alternative hypothesis. The way statistical tests are structured, we can either reject or fail to reject the null hypothesis. However, we may never accept or reject the alternative hypothesis.

For large samples, and again there are no fixed thresholds for how large a sample should be to be considered statistically large, the critical value for the t-test is 1.96.

Inferences Concerning β₀

In statistical analyses, discussing the statistical significance of the constant term (β₀) is rather uncommon. This is primarily because in most applications the constant term does not contribute to the interpretation of the model. Still, I discuss this here so that the topic is covered in full and this chapter serves as a basic reference.

Equation 7.17 presents the mathematical formulation to conduct statistical tests for significance of the constant term.

For our worked-out example, we have:

Should the Model Receive an F?

Let us now test the hypothesis that the coefficients of all variables in the regression model, except the intercept, are jointly 0. That is, taken together the variables included in the model have no predictive value. When the probability value of F-statistics is very small, we can reject the null hypothesis that the set of right-hand side variables has no predictive utility. An F-test is conducted as follows:

where SSR is the sum of squares due to regression and SSE is the sum of squared errors. In our worked-out example with only one explanatory variable, we are testing the hypothesis:

H₀: β₁ = 0

H_a: β₁ ≠ 0

In multiple linear regression, when we have more than one explanatory variable, we test the null hypothesis that all estimated coefficients are equal to 0. The decision rule is stated as shown in Equation 7.18:

For n = 10, α = .05, F(.95; 1,8) = 5.32. Because 157.1 > 5.32, I reject the null hypothesis. Again, remember in this particular case, I only had one explanatory variable in the model.

REGRESSION IN ACTION

Let us now test-drive some regression models. Because some readers may have opted to skip through the technical details to see how models are estimated, I believe it will help to repeat the important criteria for model fit. The following items are important in reading the output from a regression model:

• R-squared (r²): The overall fit of the model is determined by R-squared or the adjusted R-squared. Its value varies between 0 and 1. When comparing models, a higher R-squared means a better fit.

• P-value: The p-value is the probability associated with a statistical test. Usually, we use 95% level as our benchmark. Hence, we conclude a statistically significant finding when the p-value associated with a test is less than 0.05.

• F-test: When the p-value associated with the F-test is less than 0.05, we conclude that taken together the explanatory variables in the regression model are collectively different from 0.

• T-statistic: This evaluates the significance of an individual coefficient corresponding to a variable. If the t-test for a variable is greater than the critical value from the t-distribution (usually 1.96 for a two-tailed test; see Chapter 6 for details), we conclude that there exists a statistically significant relationship between the dependent and explanatory variable. Also, we can rely on the p-value, also reported in the regression output, for the corresponding t-test. If the p-value is less than 0.05, we can conclude a statistically significant relationship between the dependent and explanatory variables.

I begin with a small data set about the determinants of housing sales.

This Just In: Bigger Homes Sell for More

I start with an example of the determinants of housing prices that might establish my bona fides as a professor who teaches in the “department of obvious conclusions.” I am interested in determining the answer to the question: Do large homes sell for more?

The dependent variable is housing prices. The explanatory variables include the number of bedrooms, lot size, square footage of the built-up area, and a categorical variable that controls for the architectural style of the residential unit (see Table 7.4). The data set comprises 88 observations.

Table 7.4 Variables Included in the Housing Dataset

In this chapter, I report the Stata code used to generate the output. The code for other software is available from the book’s website under Chapter 7.

I present basic summary statistics in Table 7.5. The following Stata code highlights the use of a user-written command, outreg2.¹¹ It is a powerful command to generate tabular outputs. Roy Wada is the programmer behind outreg2. You can type findit outreg2 within Stata to install it.

Click here to view code image

outreg2 using 88h_1.doc , label sum(log) ///
eqkeep(mean sd min max) replace

Table 7.5 Descriptive Statistics of Housing Data

The advantage of using outreg2 is that it generates tables directly as Word documents, thus eliminating the need for cumbersome copying and pasting that requires painful formatting adjustments.

Let us have a look at the numbers reported in Table 7.5. The average house price is $425,642 with a standard deviation of $148,934. The average number of bedrooms is 3.6. The average lot size is 9,020 square feet. Right away, we notice that there is something odd about the average lot size. Usually in the U.S. and Canada, the average lot size varies between 3,000 to 6,000 square feet. The reported lot size is unusually large in the data. This becomes even more suspicious when I compare the average lot size with the average built-up space, which is 2,014 square feet.

We can learn more about the lot size by plotting a histogram. The histogram in Figure 7.4 reveals that a very small number of very large properties (lot size over 80,000 sft) are skewing the data set. We need to make a call here regarding how representative these properties are with respect to the sample. Should we declare these observations as outliers and conduct the analysis after excluding them? I leave the answer to this question to you. The following Stata command generates a histogram of lot sizes.

Click here to view code image

hist lotsize, bin(10) percent

Figure 7.4 Histogram of lot size

Sample Regression Output in Stata

I would like to estimate housing price as a function of bedrooms, lot size, square footage of the built-up area, and colonial-style home. The resulting output is presented in Figure 7.5. The output is similar to the one generated by other software, such as R, SAS, and SPSS. The output is organized in three components. The one on the top left presents the analysis of variance and summarizes the values for SSE (6.25 * 10¹¹), SSTO (1.93 * 10¹²), MSE (7.53 * 10⁹), and degrees of freedom (df).

Figure 7.5 Regression model output from Stata software

The component on the top right summarizes the overall goodness of fit statistics. It advises us that the data set comprises 88 observations. The F-statistics equal to 43.25 and the associated probability is less than 0.0001. Recall the earlier discussion regarding the F-statistics and hypothesis testing using the F distribution (Equation 7.18). The probability of getting an F-test value higher than what I have obtained in Figure 7.5 (43.25) is less than 0.05, therefore I reject the null hypothesis that collectively the coefficients in the model effectively equal to 0.

The coefficient of determination (Equation 7.14) or r² is represented as R-squared in Figure 7.5. The goodness-of-fit of the model (R-squared) is .676, which suggests that the model explains 67.6% variance in housing prices. I would consider it an excellent fit for the model. The additional statistic is the adjusted R-squared, which discounts the R-squared for the additional explanatory variables used in the model. The adjusted R-squared is more conservative than the R-squared because it penalizes the R-squared for using additional explanatory variables.

The following Stata code regresses the housing price on bedrooms, lot size, square footage of the built-up space, and a binary variable indicating the colonial style of the structure.

Click here to view code image

reg hprice bdrms lotsize sqrft colonial

The section at the bottom of Figure 7.5 reports the estimated coefficients and tests for their statistical significance. The coefficients for bedrooms (bdrms) is 15956.22. I illustrated how to estimate the coefficients in a simple regression model in Equation 7.10. This suggests that each additional bedroom adds $15,956 to the price of the house, all else being equal. Here all else being equal implies that if the built-up area, lot size, and the style of the house are kept constant, one additional bedroom will contribute $15,956 to the price. The next column reports standard errors for each estimated coefficient, which is used to conduct the test for statistical significance (t-test). I would like to refer you to the earlier discussion about calculating standard errors of coefficients and t-statistics that followed Equation 7.15.

The calculated value for the t-test is obtained by dividing the coefficient by its standard error. For the variable bdrms, I have:

Because the calculated value of the t-test is less than the one corresponding to α = 0.95, that is, 1.99, I fail to reject the null hypothesis that the coefficient for bdrms is equal to 0. The remaining coefficients are interpreted the same way. I see that the lot size and the built-up area are statistically significant determinants of house price because the calculated value of the t-test is greater than the one obtained by the theoretical t distribution.

Reporting Style Matters

It is rather odd that most books on statistics report regression results formatted the same way as the output from a software. However, academic papers and professional reports seldom report regression results the way they are reported in Figure 7.5. Often, the preferred format for reporting regression results includes reporting of the estimated coefficients, their standard errors in parentheses, and asterisks indicating the goodness of fit for each parameter. Some overall goodness of fit, such as the R-squared and number of observations are also reported at the bottom of the table.

This book reports the results formatted the way they should be reported in a publication. Table 7.6presents the output for the same model discussed earlier but formatted for publication.

Table 7.6 Regression Output Formatted for Publication

Housing Prices: Do Bigger Homes Sell for More?

I use the housing prices to explain the mechanics of regression. Let us begin with a simple model, which models housing price as a function of the number of bedrooms. I do not include other explanatory variables in the model. Furthermore, I treat the bedroom variable as a continuous variable in the first model and as a categorical variable in the second model. When a variable is introduced in the model as a categorical variable, the model estimates a separate coefficient for each level of the variable except one, which it treats as the base case.

Let us first see how housing prices differ by the number of bedrooms by calculating mean housing price for each level of bedroom variable. You can see in Table 7.7 that there are no homes in the data set with one bedroom. The average price for a two-bedroom house is $364,413. It increases to $379,870 for a three-bedroom house. The difference in price for a two-bedroom and a three-bedroom house is $15,557. Note that there are only four units in the data set with two bedrooms and one unit each with six and seven bedrooms.

Arithmetic Mean (SD) [n]

Table 7.7 House Price by Number of Bedrooms

Continuous Versus Categorical Explanatory Variables

Now let us estimate two regression models with housing price as the dependent variable and the number of bedrooms treated first as a continuous variable and then as a categorical variable. Results are reported in Table 7.8. The first model reports the results for bedrooms being a continuous variable. The model suggests that the price of a housing unit increases by $89,936 for each additional bedroom. There are no other variables in the model; hence, I cannot use the phrase all else being equal. The constant (β₀) is $104,735, which is the price of the empty lot or a house with zero bedrooms. The model explains 25.8% variance in housing prices.

Table 7.8 Regression Model of Housing Prices as a Function of Bedrooms

The second model carries significantly more coefficients. Note that there is no coefficient reported for two-bedrooms, which serves as the base case and in fact its value is reported by the constant in the model, which is $364,312. Given that the only other information in the model is the number of bedrooms treated as a categorical variable, the constant represents the base case, which is the average price for a house with two bedrooms.

Compare the value for Constant in the model labeled 2 in Table 7.8 and the average price of a house with two bedrooms reported in Table 7.7. The numbers are identical except for the rounding. Model 2 reports the coefficient for a three-bedroom house to be $15,557. It suggests that all else being equal, a three-bedroom house will sell for $15,557 more than the base case, which is a two-bedroom house. Again, compare the results with Table 7.7 and you will notice that the price difference between a two- and three-bedroom house is indeed $15,557. Thus, when no other variable is included in the model, a regression model with a categorical variable effectively reports the average or mean values. In the presence of other explanatory variables, the reported coefficients are conditional upon other factors in the model being held constant.

Notice that Table 7.8 carries additional information in the footnote. It advises us that the table reports standard errors for each coefficient in parentheses. It also provides us with a key to interpret the statistical significance of each variable. If the coefficient is missing an asterisk, it is not statistically significant at the 90% level. A single asterisk identifies statistical significance at 90% level. Two asterisks suggest statistical significance at 95% level; that is, corresponding to a test of 1.96 or better. Three asterisks suggest even higher statistical significance.

Building a Housing Model Brick-by-Brick

When we have several explanatory variables, a better approach is to introduce each new variable in the model either individually or as a group. Because of the ease of estimating models using the modern statistical software, one may be tempted to simply dump all possible explanatory variables in the model and then cherry-pick to select a subset of explanatory variables in a revised model. This approach reflects bad data science. The choice of explanatory variables must be dictated by theory and not by the model or parameter fit. For this very reason, I do not cover the step-wise modeling approaches in the text, which are essentially tools designed for those who have no fundamental understanding of the problem they are analyzing, but are willing to accept any answer that a heuristic will generate.

I believe that the price of a house is influenced by its size. I have three proxies for size; namely the number of bedrooms, the built-up area, and the lot size. I also believe that architectural style of the house could be a determinant of its price. With this theory in hand, I estimate five separate models and report them in Table 7.9.

Table 7.9 An Incremental Approach to Building a Model

I first estimate a simple regression model to regress housing prices on the number of bedrooms. The results are reported in Model 1 in Table 7.9. The model estimates that each additional bedroom will fetch an additional $89,936. The coefficient for bedrooms is statistically significant and the model explains 25.8% variance in housing prices (R-squared).

In Model 2, I introduce square footage of the lot as an explanatory variable in addition to the number of bedrooms. I know from experience that under normal circumstances, adding a bedroom does not change the square footage of the lot. Hence, I do not expect the coefficient for bedroom to change drastically between Models 1 and 2. The model suggests that each additional square foot in lot size will add $4.14 to the price, if the number of bedrooms is held constant. Similarly, each additional bedroom will add $83,104 to the price, if the lot size remains constant.

Model 3 presents an amazing illustration of all else being equal where I regress the housing prices as a function of number of bedrooms and the square footage of the built-up area of the house. Note now that each additional bedroom fetches far less and the coefficient is statistically insignificant. Each additional bedroom will increase the price of the house by $22,037 holding the built-up area constant. However, given that the coefficient for bedrooms is statistically insignificant, I fail to reject the null hypothesis that bedrooms have no impact on housing prices. How is that possible?

The answer to this riddle lies in all else being equal. Our model states that the number of bedrooms has no impact on housing prices when I keep the built-up area constant. If adding a new bedroom does not add more built space, we would have effectively either bisected a room or converted another room to a bedroom. In both circumstances, the actual built space for a unit does not change with the addition of a bedroom. In such circumstances, when the built-up space does not change, the addition of a bedroom is rather meaningless from the price perspective.

Model 3 suggests that each additional square foot of built space will fetch $186.2, while I keep the number of bedrooms constant. The coefficient for built space is statistically significant. Also note that the R-squared jumps from 25.8% in Model 1 to 63.2% in Model 3 suggesting that the square footage of the built-up space is a very strong predictor.

I reintroduce the square footage of the lot size in the model along with the number of bedrooms and the built-up space. Again, I find the number of bedrooms to be statistically insignificant in Model 4, whereas the other two variables are statistically significant. The model fit improves slightly to 67.2%.

Lastly, I introduce Colonial to the model and find that in the presence of the proxies for size, Colonial returns a statistically insignificant coefficient.

The following Stata code generates the output reported in Table 7.9.

Click here to view code image

reg hprice bdrms
outreg2 using 88h_4.doc, label replace
reg hprice bdrms lotsize
outreg2 using 88h_4.doc, label append
reg hprice bdrms sqrft
outreg2 using 88h_4.doc, label append
reg hprice bdrms lotsize sqrft
outreg2 using 88h_4.doc, label append
reg hprice bdrms lotsize sqrft colonial
outreg2 using 88h_4.doc, label append

Don’t Throw Out the Baby or the Bath Water

Model 5 in Table 7.9 includes four variables, namely bedrooms, lot size, built space, and Colonial. Bedrooms and Colonial are statistically insignificant, while the other two are statistically significant. Why should not we remove the statistically insignificant coefficients? After all the two variables do not contribute to the model fit. Some would argue that we should remove the statistically insignificant coefficients. I urge strongly against it. If the theoretical model requires a variable to be included, then it should be included, regardless of its statistical significance.

Logs, Levels, Non-Linearities, and Interpretation

The functional form of a model influences its interpretation. Jeffery Wooldridge in Introductory Econometrics mentions the four basic combinations of variable types in a regression model and their interpretation.¹² Consider that we can introduce a variable either in its raw form as level or as log transformed. In business and economics, the relationship between two variables is explained as elasticity: that is, the percentage change in one variable with respect to a percentage change in other. This is easily achieved in regression by (natural) log-transforming the dependent and the explanatory variables before regressing them. The five functional forms, including four from Wooldridge, are listed in Table 7.10.

Table 7.10 Variable Transformation and Interpretation in Regression Models

These concepts are illustrated in Table 7.11, which presents three model formulations. Model 1 presents the elasticity model where I have log transformed the dependent and the explanatory variables. The estimated coefficient is interpreted as elasticity; that is, a percentage change in the built space increases housing price by 0.87%. The second model uses colonial as the explanatory variable while the dependent variable is log-transformed housing price. The estimated coefficient for Colonial homes is 0.118. I multiply the coefficient by 100 to conclude that the Colonial-styled houses sell for 11.8% higher than otherwise. Model 3 attempts to capture the non-linearities in the data set. The positive coefficient for the variable, square of beds, suggests that housing prices are higher for very large homes.

Table 7.11 Model Formulations

I would like to explain the need for variable transformation. Consider a wage model where the only explanatory variable is experience. The model suggests that the hourly wage rate increases with experience in a particular industry. Let us assume that the minimum wage is $7 and is earned by those without any experience. Let us also assume that on average the wage rate increases by $2 for every additional year of experience. Thus, our model suggests that someone with two years of experience will earn 7 + 2 * 2 = 11 dollars. And someone with 40 years of experience will earn $87 per hour.

I also know that someone with 40 years of experience will be older. If we are speaking of workers in the manufacturing sector, the experienced old age workers may not be in very high demand. It may turn out that after some point in time, additional experience may not result in higher wages, which may taper off or start declining after some time. This non-linear relationship is common in many sectors and behaviors and can be modeled using a quadratic formulation.

Let us now consider the same wage equation, but this time I account for the fact that wages will not continue to climb with experience. The revised wage (quadratic) equation is as follows.

wage = 7 + 2 * experience – .03 * experience²

I plot this relationship in Figure 7.6. I can see that the wage continues to increase with experience up to 33 years in our hypothesized model. After that, any further experience in experience results in a decline in wages.

Figure 7.6 Non-linear Relationship Between Wage and Experience

The regression models are quite capable of capturing such non-linearities. All we need to do is to include the squared term of any variable that we think enjoys a quadratic relationship with the dependent variable. I report the Stata code used to generate the output in Table 7.11.

Click here to view code image

reg lprice lsqrft
outreg2 using 88h_5.doc, label replace
reg lprice colonial
outreg2 using 88h_5.doc, label append
reg lprice bdrms bedsq
outreg2 using 88h_5.doc, label append

Does Beauty Pay? Ask the Students

Mirror mirror on the wall, who is the smartest professor of all?

In this section, I present a systematic analysis of the determinants of teaching evaluation scores recorded at the University of Texas. The primary purpose of the analysis is to determine whether teaching evaluation scores are influenced by an instructor’s looks. My underlying hypothesis is that attractive instructors receive higher teaching evaluations from students, even when I hold other factors, which could influence teaching evaluations, constant.

Table 7.12 lists the variables and their descriptions in the data set. I have worked with the same data set in Chapters 4, 5, and 6. The teaching evaluations are recorded on a scale of 1 to 5, where 5 represents the highest teaching evaluation. I believe that teaching evaluations are a reflection of the instructor’s ability to communicate concepts to students. A successful instructor would therefore require at least two attributes. First, the instructor has to be knowledgeable of the subject. Second, the instructor should possess good communication skills. Unfortunately, these two important determinants of teaching evaluations are not recorded in the data set. I do have other proxies for factors that may influence one’s teaching effectiveness.

Table 7.12 List of Variables in the Teaching Evaluation Data Set

I outline some of the assumptions here. I believe that the native English speakers would have an advantage in communicating in English over other instructors whose first language is not English. The language proficiency, or the lack of it, could serve as a determinant of teaching effectiveness. I also believe that upper-level courses are usually harder in content and therefore instructors who teach advanced courses may receive lower teaching evaluation primarily because of the advanced contents and the complexity of the subject matter.

Another proxy for advanced courses is the number of credits attached to the course. Single credit courses are normally offered in the early years and their content is usually of introductory nature. I hypothesize that instructors teaching introductory courses will receive higher teaching evaluations because of the ease in learning facilitated by the simplicity of the contents.

Research-oriented instructors at a university are often tenured, which guarantees them job security so that they may pursue and publish their research without the fear of losing their job. Tenured professors are thus senior academics with years of teaching and research experience. Tenure is the fundamental tenant of academic freedom that distinguishes universities from community colleges or schools. One can argue that tenured professors, given their vast teaching experience, are likely to receive higher teaching evaluations. On the other hand, one can also argue that since the tenured professors have job guarantee, they may not necessarily put as much effort in teaching as the untenured instructors would.

I also hypothesize that the visible minority status of an instructor may play a role in their teaching evaluations. In North America, universities have, in the recent past, opened their doors to non-Caucasian academics. Students usually are not exposed to visible minority instructors. Faculty members in certain disciplines, such as engineering, often include a large number of visible minority professors who are often recent immigrants. However, in other disciplines, such as humanities and letters, visible minority instructors are not that common.

I hypothesize that visible minority instructors would focus more on their teaching because of their smaller numbers in academia and their desire to prove themselves. Based on this assumption, I believe that visible minority instructors would receive higher teaching evaluations because of their willingness to prove themselves in academia.

Gender has been a source of controversy in academia for a very long time. Women instructors have been paid less than their male counterparts have, even when their performance has been similar, if not better, than their male colleagues. Similarly, few women have been installed in academic leadership roles. Given these systematic differences between men and women instructors, I believe that gender might play some role in teaching evaluations. However, I do not know in advance whether gender should have a positive or a negative influence on teaching evaluations.

Armed with these assumptions and hypotheses, I move ahead to test them using regression analysis. I estimate a series of models moving from a simpler to more complex specifications. I report model results in Table 7.13.

Table 7.13 Do Teaching Evaluations Depend Upon Instructor’s Looks?

The first model in Table 7.13 regresses teaching evaluations as a function of instructors’ gender. The negative and statistically significant coefficient for female instructors in Model 1 suggests that female instructors receive lower teaching evaluations than their male counterparts do. I cannot use all else being equal here because there is no other explanatory variable used in the model.

The second model introduces two other instructor-specific variables in addition to gender. I notice that teaching evaluations for minority instructors are negative. However, this coefficient is not statistically significant. I also notice a positive and statistically significant coefficient for native speakers of English language. Female continues to be a statistically significant and negative coefficient in the second model. This implies that even when I hold minority status and English proficiency constant, female instructors receive lower teaching evaluations than their male counterparts do.

In the third model, I introduce three additional variables to the model in addition to the ones reported earlier in Model 2 (Table 7.13). I find that the instructors’ tenure status and upper division courses report negative coefficients. However, both variables returned statistically insignificant coefficients. I find higher teaching evaluations for single credit courses, which are introductory courses and the coefficient is statistically significant. Female and minority status of an instructor continued to be negative and statistically significant predictors, all else being equal.

In Model 4, I introduce instructors’ appearance as an explanatory variable in the model in addition to the variables reported in Model 3. I note that instructors’ looks are positively correlated with the teaching evaluations, all else being equal. The coefficient for instructors’ appearance is statistically significant and positive even when I hold gender, minority status, English proficiency, tenure status, and attributes of courses constant in the model.

I also notice an increase in the model fit when I include more variables to explain teaching evaluations. Model 4 explains 15.6% variance in teaching evaluations.

I report here the Stata code and the resulting Table 7.13.

Click here to view code image

reg eval female
outreg2 using teaching.doc, label replace
reg eval female minority_inst english_speaker
outreg2 using teaching.doc, label append
reg eval female minority_inst english_speaker tenured upper
single_credit
outreg2 using teaching.doc, label append
reg eval female minority_inst english_speaker tenured upper
single_credit beauty
outreg2 using teaching.doc, label append

Survey Data, Weights, and Independence of Observations

Sample data, such as the one reported for teaching evaluations, often require two additional considerations, which must be accounted for in the regression models. First, not every observation in the survey data carries the same influence. Their relative importance, or weight, depends upon the sampling frame. Some observations would carry more weight than others.

In addition, not all observations may be treated as independent observations in a sample data set. For instance, the same respondent might record more than one observation in the data set, which would require you to account for the fact that multiple observations from the same respondent cannot be treated as independent observations. I explain these concerns and how to deal with them in the following paragraphs.

The results reported earlier are subject to certain possible biases. For instance, what if higher teaching evaluations are reported for small-sized classes and vice versa? In addition, it may turn out that a small number of students responding to a teaching evaluation questionnaire in a large class may bias the evaluation scores.

We need to account for students’ response rate in the model. One way to do this is to use the response rate as weights in the model. Thus, courses in which a higher proportion of students responded to the teaching evaluation questionnaire would get a higher weight in the model. Furthermore, courses with lower response rates will not be able to influence or bias teaching evaluations.

The other thing to consider is that teaching evaluations are recorded for courses, and not necessarily for instructors. In our data set, several instructors have taught more than one course. Hence, I have multiple observations for several instructors. One can argue that teaching evaluations recorded for courses taught by the same professor could not be treated as independent observations.

In this section, I account for both biases. Table 7.14 reports three models. Model 1 is the same as Model 4 in Table 7.13. I refer to this as the unweighted model. I have repeated the output to facilitate comparison between the weighted and the unweighted model. Model 2 in Table 7.14 presents the results for the weighted model where I have used response rate as weights.

Table 7.14 Robust Estimates of Teaching Evaluations Weighted by Student Response Rate

A comparison between the weighted and unweighted model reveals that parameter estimates and standard errors differ between the two models. However, the differences are relatively small in magnitude. More importantly, the direction of the relationship and the statistical significance of the estimated parameters have not changed between the weighted and unweighted models. In addition, the overall model fit, depicted by the R-squared, suggests that both models explained approximately 16% variance in teaching evaluations.

Model 3 in Table 7.14 accounts for the limitation that not all observations are independent of each other. Because numerous instructors have taught more than one course, I need to account for such clustering in the data set. A convenient way to address this limitation is to use robust standard errors. The revised model specification affects only the standard errors and not the estimated parameters. If I were not to account for such clustering, non-robust standard errors could be smaller than the robust standard errors, which could erroneously make some coefficients statistically significant. The use of robust standard errors accounts for such false positives.

Notice the output in Model 3 in Table 7.14. I have used student response rates as weights in the model. At the same time, I have used robust standard errors by accounting for the clustered nature of the data set. You will notice that the estimated coefficients do not differ between Models 2 and 3. However, standard errors are different between the two models.

The most profound impact of using robust standard errors is seen for the variable that controls for English proficiency. The coefficient for native English speaker is statistically significant in Model 2. However using robust standard errors makes this coefficient statistically insignificant in Model 3. Thus, if I were to rely on Model 3, I would say that English proficiency is not a statistically significant predictor of teaching evaluation, all else being equal.

I present here the Stata code and the resulting output in Table 7.14.

Click here to view code image

reg eval female minority_inst english_speaker tenured upper
single_credit beauty
outreg2 using teach_2.doc, label ctitle(un-weighted model) replace
reg eval female minority_inst english_speaker tenured upper
single_credit beauty [aw=weight]
outreg2 using teach_2.doc, label ctitle(weighted model) append
reg eval female minority_inst english_speaker tenured upper
single_credit beauty [aw=weight], vce(cluster prof)
outreg2 using teach_2.doc, label ctitle(weighted model, clustered standard errors) append

I posed earlier the question whether the instructors’ looks affected their teaching evaluations. I now have the answer. Yes they do. Even when I control for language proficiency, gender, tenure, and minority status, and proxies for courses’ rigor, looks still mattered. All else being equal, instructors deemed good-looking by the students received higher teaching evaluations than did the rest. I feel doomed but knowledgeable. I now know the reason for not receiving the top teaching evaluations!

You will notice slight differences between the numbers reported in Table 7.14 and the ones in the paper (in Table 3) published by Hamermesh and Parker (2005).¹³ I have used identical data sets, but have obtained slightly different results. The reason for the difference is how I weighted the model. Hamermesh and Parker used the number of students responding to the teaching evaluation questionnaire as weights. I instead used the percentage of students registered in a course who responded to the teaching survey as weights—hence the difference.

I would like you to change the weighting variable to the number of students who responded to the questionnaire in the weighted model with clustered standard errors. You will notice the resulting model returns exactly the same output as reported by Hamermesh and Parker in Table 3 of their paper.

What Determines Household Spending on Alcohol and Food

Finally, I illustrate regression models using a data set of household spending on food and alcohol. The recipient of the 2015 Nobel Prize in Economics, Professor Angus Deaton, was in fact recognized for his efforts to improve understanding of income and consumption at the individual level. The data set captures the spending of 1,000 Canadian households, which you can download from the book’s website. Table 7.15 lists the variables included with this data set.

Table 7.15 Variables Depicting Household Spending on Food, Alcohol, and Transport

I am interested in determining the answers to the following questions.

• Do households with children spend less on alcohol?

• Do high-income households spend more on alcohol than others do?

• Do high-income households spend more on alcohol, even when they have children?

• What is the impact of an additional child on food expenditure?

Impact of Demographics and Income on Alcohol Expenditure

Let us begin with the basic analysis. The first step is to obtain descriptive statistics. Table 7.16 presents the descriptive statistics. The following is the Stata code to generate Table 7.16.

Click here to view code image

outreg2 using hhld_1.doc , label sum(log) ///
eqkeep(mean sd min max) replace

Table 7.16 Descriptive Statistics on Household Spending on Food, Alcohol, and Transport

The data set contains information on 1,000 households from Canada. The average number of adults is 1.95 persons per household. The minimum number of adults is 1 and the maximum is 3. There are 0.8 children per household with a minimum of 0 and a maximum of 4 children. In total, the average number of persons per household is 2.78 with a minimum of 1 and a maximum of 6 persons.

On average, households spend $30.7 dollars per week on alcohol and $97.3 per week on food. The average weekly household income is $663.

I would like to determine the impact of the presence of children on spending on alcohol. I tabulate the change in weekly alcohol spending for the number of children in the household. I have hypothesized that households with children would have to spend significant amounts on goods and services related to children. This would leave less discretionary income for alcohol consumption.

Table 7.17 suggests that the weekly expenditure on alcohol in fact declines with the increase in the number of children. I also note that a large number of households in the data set (560 out of 1,000) do not have children living with them.

Table 7.17 Alcohol Spending and Presence of Children

A Theoretical Model of Spending on Alcohol

I assume that weekly spending on alcohol is dependent on the number of adults in the household. I also assume that households with children will spend less on alcohol because such households would have to incur other children-related expenses. I can describe the theoretical model as follows:

If y is the dependent variable, that is, weekly expenditure on alcohol and x₁ represents adults, x₂ represents children, and x₃ represents household income, then the model can be written as:

Y = f(x_1, x_2, x₃)

The preceding implies that y is a function of x₁, x₂, and x₃. The regression model is expressed as follows:

Y = β₀ + β₁x₁ + β₂x₂ + β₃x₃ + ε

The betas in the preceding equation are the regression coefficients that explain the relationship between weekly alcohol expenditure and the explanatory variables. Epsilon is the error term that accounts for what has not been captured by the model.

Table 7.18 presents the results of the regression models estimated using Stata software. I report results for four separate models. Model 1 regresses weekly alcohol spending on the number of adults in the household. Model 2 regresses weekly alcohol spending on adults and households’ weekly income. Model 3 regresses weekly alcohol spending on adults, household’s weekly income, and the number of children in the household. Model 4 is the same as Model 3 except that it is estimated for the subset of households with children.

Click here to view code image

reg alcoh adults
outreg2 using alcoh-1.doc, label ctitle(alcohol,) replace
reg alcoh adults income
outreg2 using alcoh-1.doc, label ctitle(alcohol,) append
reg alcoh adults income kids
outreg2 using alcoh-1.doc, label ctitle(alcohol,) append
reg alcoh adults income kids if kids>0
outreg2 using alcoh-1.doc, label ctitle(alcohol, households with no
children) append

Table 7.18 Regression Model of Household Spending on Alcohol

Here is the equation for Model 1, which uses one explanatory variable; that is, adults.

y = 15.47 + 7.78 * adults + ε

The model suggests that the base weekly expenditure on alcohol is $15.47. Afterward, each additional household member results in $7.78 of additional weekly alcohol spending. The impact of income is captured in Model 2. The revised model is expressed as follows:

y = 14.24 + 4.63 * adults + 0.011 * income + ε

This implies that a dollar increase in weekly income results in $.011 increase in weekly alcohol spending, whereas an additional household member results in an additional $4.63/week in alcohol spending. Notice the decline in the coefficient’s value for weekly spending resulting from each additional adult from $7.78 in the model without income to $4.63 in the model with income.

I can see the utility of regression analysis in Model 2 in Table 7.18, which allows us to state that if income is controlled, each additional household member will result in an additional $4.64 in weekly alcohol spending. Without controlling for the impact of income, the magnitude of the impact of each adult on alcohol spending was twice as much at $7.78. Again, this property of regression models is called all else being equal. I also note that both income and adults return statistically significant coefficients.

Now I add another variable, the number of children, to the model and see the impact of the children on alcohol spending. The estimated model (Model 3 in Table 7.18) is reported as follows:

y = 15.5 + 4.79 * adults + .012 * income – 3.04 * children + ε

The preceding equation suggests that the base weekly expenditure on alcohol is $15.5. Afterwards, each additional household member results in $4.79 of additional alcohol spending. The income variable suggests that each additional dollar results in an additional $0.012 spent on alcohol per week. As for children, I find that each additional child results in $3.04 less in weekly alcohol spending. Hence, for a household with two children, the weekly spending on alcohol declines by $6.08 (2 × 3.04).

Another way of looking at the impact of children is to re-estimate the model only for households with children; that is, I exclude households with no children. The results of the revised model are reported in Model 4 in Table 7.18.

I notice that the impact of the presence of children is lower (–$1.18) than what I observed in Model 3. In addition, the coefficient is statistically insignificant such that I can no longer conclude that for households with children, having additional children will result in lower spending on alcohol.

It may turn out that there is a non-linear relationship between the number of children and weekly spending on alcohol (see Figure 7.7). I can test this hypothesis by converting the children variable from being a continuous variable to a factor or categorical variables. I do the same for the number of adults and re-estimate the model.

Figure 7.7 Average spending on alcohol decreases with the number of children

The underlying assumption for this approach states that the impact of an increase in the number of adults from 1 to 2 persons is different from the one resulting in an increase from 2 to 3 persons. Table 7.19 reports the results of the revised model estimated using Stata. Also reported is the Stata code to generate the results.

Click here to view code image

reg alcoh income i.adults i.kids
outreg2 using alcoh-2.doc, label ctitle(alcohol) replace

Table 7.19 Revised Formulation for Spending on Alcohol

Notice that a household with two adults spends $3.97 more than the one with one adult, whereas a household with three adults spends $10.3 more than the one-adult household does. The later variable is also statistically significant. Also observe that households with one child spend $6.2 less on alcohol than a household with no children, all else being equal. The estimated coefficient is statistically significant. Households with three children spend $10.3 less on alcohol than households without children, all else being equal. Again, this variable is statistically significant. At the same time, the model returned insignificant coefficients for households with two and four children.

What Influences Household Spending on Food?

I extend the analysis to weekly spending on food. Again, my primary variable of interest is the presence of children in a household and its effect on food spending. I see that average weekly spending on food increases with the number of children in a household (see Table 7.20). Note that I have used tabout,¹⁴a user written command in Stata, which you need to install before using it.

Click here to view code image

tabout kids using table10.doc, ///
c(mean food) f(2c) sum layout(rb) h3(nil) npos(both) style(tab) append

Table 7.20 Weekly Spending on Food and the Number of Children in a Household

I assume that weekly spending on food is dependent on number of adults in the household and that households with children will spend more on food. The estimated model is identical in specification as the one estimated earlier for weekly spending on alcohol.

Table 7.21 presents the results of the regression models estimated using Stata software. I report results for four different model specifications. Model 1 regresses weekly food spending on the number of adults in the household. Model 2 regresses weekly food spending on adults and households’ weekly income. Model 3 regresses weekly food spending on adults, household’s weekly income, and the number of children in the household. Model 4 is the same as model 3 except that it is estimated for a subset of households with children.

Click here to view code image

reg food adults
outreg2 using food-1.doc, label ctitle(food,) replace
reg food adults income
outreg2 using food-1.doc, label ctitle(food,) append
reg food adults income kids
outreg2 using food-1.doc, label ctitle(food,) append
reg food adults income kids if kids>0
outreg2 using food-1.doc, label ctitle(food, households with no
children) append

Table 7.21 Regression Models to Capture Weekly Spending on Food

Let us write the equation from the preceding output in which we have used only one explanatory variable: adults.

y = 29.4 + 34.73 * adults + ε

The model suggests that the base weekly expenditure on food is $29.4. Afterwards, each additional adult household member results in $34.73 of additional weekly food spending. The impact of incomeon spending on food is captured in Model 2, which is expressed in the following equation:

y = 23.7 + 20.3 * adults + .05 * income + ε

The model suggests that a dollar increase in weekly income results in $.05 increase in weekly food spending, whereas an additional adult household member results in an extra $20.3/week in food spending. Notice that when income is included in the model, the coefficient for adults declines from $34.73 to $20.3. Model 2 states that if we control for income, each additional household member will result in an additional $20.3 in weekly food spending. When we control for the impact of income, the magnitude of the impact of each adult on food spending declines by 40%. Both coefficients in the second model are statistically significant.

Model 3 in Table 7.21 reports the results for the specification with three regressors, namely weekly household income, adults, and the children in households. The resulting three coefficients are all statistically significant and positive. The model can be rewritten as follows:

y = 19.1 + 19.76 * adults + .05 * income + 10.86 * kids + ε

Model 3 suggests that the base weekly expenditure on food is $19.1. Afterward, each additional adult results in $19.76 of additional food spending. The income variable suggests that each additional dollar in income results in $0.05 more spent on food per week. As for children, each additional child results in $10.9 more in weekly food spending. Hence, for two children the weekly spending increases by $21.8 (2 × 10.9).

Another way of looking at the impact of children is to re-estimate Model 3 only for households with children; that is, I exclude households without children. The results of the revised model are presented in Model 4. We see a large decline in the model fit (R-squared). However, we do not observe a large difference in the estimated coefficients or their statistical significance.

Lastly, I re-estimate the model by creating a dummy variable for each level of adults and children to isolate their impact on weekly food spending. The underlying assumption here is that the weekly spending on food will increase at a different rate when the number of children increases from 1 to 2 than it would when the number of children increases from 2 to 3. I hypothesize that the increase in food spending does not have a linear relationship with the increase in the number of children in the household. Table 7.22 presents the results estimated in Stata for the revised specification. Also reported is the Stata code used to generate Table 7.22.

Click here to view code image

reg food income i.adults i.kids
outreg2 using food-2.doc, label ctitle(food) replace

Table 7.22 Revised Formulation with Dummy Variables for Adults and Children

The regression model presented in Table 7.22 suggests that all else being equal, the household’s food expenditure increases with the number of children. For instance, a household with one child will spend $14.6 more than a household with no child will spend. Similarly, a household with four children will spend $37.07 more than the household with no children.

A comparison between the results presented in Table 7.19 and Table 7.22 is important to understand how regression models could be helpful in developing a better understanding of the behaviors we study. Note that the weekly spending on alcohol declines with the increase in the number of children. However, the weekly spending on food increases with the increase in the number of children.

Even more interesting is the comparison of the overall model fit such that the alcohol model explains only 3.5% of the variance in alcohol spending, whereas the other model explains 32.6% of the variance in a household’s weekly food spending. This suggests that the explanatory variables used in the models are more suited to explain spending on food than on alcohol.

ADVANCED TOPICS

The regression models, though very powerful tools, are vulnerable to mis-specifications and violations of the assumptions that make them work. The applied statisticians, data scientists, and analysts sometimes forget, or worse, are ignorant of the conditions under which the regression models could be applied. Any good introductory text in econometrics, such as the ones by Peter Kennedy, Damodar Gujarati, and Jeffrey Wooldridge, offer detailed discussions on the conditions necessary for regression models to work.¹⁵

I would like to point our readers to two very important considerations that any applied analyst using regression models should know. First is the assumption about constant variance of error terms, a condition known as homoskedasticity. The other is rather a limiting condition known as multicollinearity, which happens when two or more explanatory variables are highly correlated. Ignoring these limitations may lead us to draw erroneous conclusions.

Homoskedasticity

Regression models assume that the variance of the error term (), conditional upon the explanatory variables, is constant. This assumption about constant variance is called homoskedasticity. The violation of this assumption is called heteroskedasticity, which implies that the variance of the error term changes systematically for various segments in the population.

A good example of heteroskedasticity is to consider a model where I regress food spending on income. Given that higher income is likely to be associated with higher spending on food, it is quite possible that the residuals’ variance may increase in magnitude with income, suggesting heteroskedasticity. Similarly, in a model where I regress savings on income, if the variance of the error term increases with income, heteroskedasticity might be present in the model.

Heteroskedasticity does not bias the coefficients, but it affects the standard errors of the coefficients, which affects inferential statistics. If heteroskedasticity returns lower standard errors for coefficients, we may erroneously conclude statistically significant relationships where none existed. Several tests have been devised to test the presence of heteroskedasticity. If heteroskedasticity is present, some remedial measures are available, which I will discuss in the following paragraphs.

The most commonly used to test for heteroskedasticity is the Breusch-Pagan (BP) test. Following are the steps involved in conducting the BP test.

1. Estimate an ordinary least squares regression model.

2. Generate residuals from the model. Square residuals for each observation.

3. With squared residuals as the dependent variable, run an ordinary least squares regression model using the same explanatory variables.

4. Conduct an F-test on the revised model.

5. If the p-value of the F-test is lower than the threshold, we reject the null hypothesis of constant variance or homoskedasticity.

Let us test for heteroskedasticity using the housing price example. I regress the housing price on lot size, square footage of the built-up area, and the number of bedrooms. I generate the residuals from the model and obtain squared residuals. Lastly, I re-estimate the model with the squared residuals as the dependent variable. The results, obtained in Stata, are presented in Table 7.23 (along with the Stata code) that also lists the F-test and the associated p-value. Under Model 2, the p-value for the F-test is 0.002, which recommends that I reject the null hypothesis of constant variance and conclude the presence of heteroskedasticity.

Click here to view code image

reg price lotsize sqrft bdrms
outreg2 using bp-1.doc, label ctitle(BP test,original model) ///
adds(F-test, e(F), Prob>F, e(p), Adj R-sqrd, e(r2_a)) replace
estat hettest

predict _res1, residuals
gen _res2= _res1^2
twoway (scatter _res1 ) , scheme(s2mono)

reg _res2 lotsize sqrft bdrms
outreg2 using bp-1.doc, label ctitle(BP test,res-squared model) /// adds(F-test, e(F), Prob>F, e(p), Adj R-sqrd, e(r2_a)) append

Table 7.23 Results for Breusch-Pagan (BP) Test

Now that I have concluded the presence of heteroskedasticity, I need to find a solution. In most instances, taking the natural log of the variables solves the problem. I repeat the analysis by first regressing the log of price on the log-transformed versions of the same three explanatory variables. After estimating the model, I generate residuals from the model and then square the residuals. Lastly, I re-estimate the model with the squared residuals as the dependent variable and the log-transformed versions of the three explanatory variables. I review the F-test to see if I can reject the null hypothesis.

The results for the revised log-transformed model are presented in Table 7.24. Also listed is the Stata code used to generate the table.

Table 7.24 Log-Transformed Version of the Housing Prices Models

The p-value for the F-test is 0.183, suggesting that I cannot reject the null hypothesis of constant variance and therefore I conclude homoskedasticity.

Click here to view code image

reg lprice llotsize lsqrft lnrooms
outreg2 using bp-2.doc, label ctitle(BP test,log-transformed model) ///
adds(F-test, e(F),  Prob>F, e(p), Adj R-sqrd, e(r2_a)) replace
estat hettest

predict _res1a, residuals
gen _res2a= _res1a^2

reg _res2a llotsize lsqrft lnrooms
outreg2 using bp-2.doc, label ctitle(BP test,res-squared model)  ///
adds(F-test, e(F),  Prob>F, e(p), Adj R-sqrd, e(r2_a)) append

Because the presence of heteroskedasticity affects the standard errors associated with the estimated parameters, one can also try to address this by fixing the standard errors in the model specification. Statistical software, including Stata, offers the option to report “robust” standard errors, which are robust to the heteroskedasticity present in standard errors. The term “robust” holds a special meaning in statistical analysis. Robustness implies that even in the presence of heteroskedasticity, which affects standard errors, the robust estimator will return standard errors that are immune to the impact of heteroskedasticity.

In Stata, one can ask for robust standard errors by including vce(robust) in the model specification. The formal name for the robust standard error estimator is Huber/White/sandwich estimator.

I present the same model with regular and robust standard errors in Table 7.25. I note that the coefficients for the two models are identical. However, the standard errors, and hence the statistical significance, are different between the two models. The coefficient for lot size is statistically significant in the model with regular standard errors. However, it is statistically insignificant, and indistinguishable from 0, for the model reporting robust standard errors.

Table 7.25 Comparing Regular Standard Errors with Robust Standard Errors

This implies that if we were not to use robust standard errors, we would have concluded lot size to be a statistically significant predictor. If we were to base our decisions strictly on statistical theory, I would conclude that lot size is not a strong predictor of housing prices. But I know that not to be the case. Lot size matters greatly in determining housing prices. It is for this reason I recommend including variables that make sense theoretically.

Click here to view code image

reg hprice lotsize sqrft bdrms
outreg2 using bp-3.doc, label ctitle(dep_var: house price, regular
standard errors) ///
adds(F-test, e(F), Prob>F, e(p), Adj R-sqrd, e(r2_a)) replace

reg hprice lotsize sqrft bdrms, vce(robust)
outreg2 using bp-3.doc, label ctitle(dep_var: house price, robust
standard errors) ///
adds(F-test, e(F), Prob>F, e(p), Adj R-sqrd, e(r2_a)) append

Multicollinearity

Multicollinearity arises when strong correlation exists between some explanatory variables. This happens when more than one explanatory variable in the model measures the same phenomenon. Multicollinearity has puzzled data scientists for years. It is hard to define and its impact is not that straightforward to measure. I would like to mention multicollinearity here because its presence effectively changes the magnitude of variables. In extreme cases, multicollinearity can even reverse the sign of an estimated coefficient or strip the coefficient of its statistical significance.

I illustrate multicollinearity with an example. I regress housing prices on the number of bedrooms, square footage of the built-up area, and lot size in square feet. I enter these variables one-by-one in three regression models. Economic theory tells us that the price of a house is, to a large extent, determined by its size. I have therefore used three proxies for size as regressors. Table 7.26 presents the results of the model estimated in Stata. Also reported with the table is the Stata code used to generate it.

Table 7.26 Regressing Housing Prices to Find Evidence of Multicollinearity

Model 1 in Table 7.26 suggests that each additional bedroom adds $89,936 to housing value. When I add built-up area to the model, the number of bedrooms become statistically insignificant. Model 2 in Table 7.26 suggests that the number of bedrooms do not affect housing values (that is, bedrooms are not statistically significant), when I control for the built-up area. The model fit jumps from 25% (R-squared) to 62.3%. When I add the third variable, that is, lot size, the number of bedrooms continues to be statistically insignificant.

It appears that in the presence of built-up space, the number of bedrooms do not help explain variations in housing prices. This is rather counterintuitive. Two homes with same lot size and built-up area can sell for different prices based on the difference in the number of bedrooms. The architectural design and style allows variety to support a different number of bedrooms, even when the built-up space is the same. This phenomenon is more pronounced in the condominium markets where units may have the same overall square footage, but the number of bedrooms differs based on the layout of the unit. To argue that the unit size is sufficient in explaining prices is a troubling argument.

Could this be a result of multicollinearity? The number of bedrooms and built-up space report a very high correlation (0.53, p < 0.0001). It is quite possible that the number of bedrooms lost its statistical significance because of high positive correlation with built-up space.

Click here to view code image

reg hprice bdrms
outreg2 using multi-1.doc, label ctitle(beds) ///
adds(F-test, e(F),  Prob>F, e(p), Adj R-sqrd, e(r2_a)) replace

reg hprice bdrms sqrft
outreg2 using multi-1.doc, label ctitle(beds + sqrft) ///
adds(F-test, e(F),  Prob>F, e(p), Adj R-sqrd, e(r2_a)) append

reg hprice bdrms sqrft lotsize
outreg2 using multi-1.doc, label ctitle(beds + sqrft +lotsize) ///
adds(F-test, e(F),  Prob>F, e(p), Adj R-sqrd, e(r2_a)) append

pwcorr hprice bdrms sqrft lotsize, sig

Variance Inflation Factors (VIF) are used to test for multicollinearity. If the VIF are above a certain threshold (for example, 5 or 10), one may have to consider dropping one of the correlated variables and re-estimating the model. However, I advise strongly against dropping variables from the model just because of multicollinearity. Our discussion about number of bedrooms and the square footage of a condominium implies that even if the model suggests multicollinearity, we must not drop a statistically insignificant variable from the model, especially when its inclusion in the model is warranted by statistical theory.

If multicollinearity is pervasive in the data set, another option will be to use factor analysis to group highly correlated variables together, and run the regression model using the resulting factors as regressors.

SUMMARY

This chapter started by remembering Sir Frances Galton’s pioneering work that introduced regression models as a tool for scientific inquiry. I then introduced the mechanics of regression models for those readers interested in mathematical details. Later, I used examples from housing markets, consumer spending on food and alcohol, and the relationship between teaching evaluations and instructors’ looks to illustrate regression models in action.

All else being equal has been the focus of this chapter. I have demonstrated that only when I control for other influences, can I determine an accurate estimate for the influence of another variable on the phenomenon or behavior of interest.

The height-earning example, described earlier in the chapter, reinforces the thesis statement in this chapter; that is, all else being equal, workers’ height does not influence their earnings. The regression model helped Schick and Steckel (2015) realize that height was correlated with higher cognitive and non-cognitive skills that reflected favorably in a person’s earning potential.

Regression models remain the workhorse of statistical analysis. They will continue to be a prized tool for data scientists who are interested in exploring the complex relationships hidden in large and small data sets.

ENDNOTES

1. Haider, M. (1999). “Development of Hedonic prices indices for freehold properties in the Greater Toronto Area, application of spatial autoregressive techniques.” Retrieved from https://tspace.library.utoronto.ca/handle/1807/12657

2. Hadhazy, A. (2015). “Will humans keep getting taller?” Web published on May 14, 2015. Retrieved July 16, 2015, from http://www.bbc.com/future/story/20150513-will-humans-keep-getting-taller

3. Matthews, T. J. and Hamilton, B. E. (2009). “Delayed childbearing: more women are having their first child later in life.” NCHS Data Brief, (21) 1–8.

4. Vandegrift, D. and Yoked, T. (2004). “Obesity rates, income, and suburban sprawl: An analysis of US states.” Health & Place, 10(3), 221–229.

5. Kulu, H., Boyle, P. J., Andersson, G., and others. (2009). “High suburban fertility: Evidence from four Northern European countries.” Demographic Research, 21(31), 915–944.

6. Kulu, H. and Boyle, P. J. (2008). “High Fertility in City Suburbs: Compositional or Contextual Effects?” European Journal of Population, 25(2), 157–174.

7. Repetto, R. (2013). Economic equality and fertility in developing countries. Routledge.

8. Schick, A. and Steckel, R. H. (2015). “Height, Human Capital, and Earnings: The Contributions of Cognitive and Noncognitive Ability.” Journal of Human Capital, 9(1), 94–115.

9. Varian, H. R. (2014). “Big Data: New Tricks for Econometrics” The Journal of Economic Perspectives: A Journal of the American Economic Association, 28(2), 3–27.

10. Batterham, A. M. and Greg, A. (2005). “How big does my sample need to be? A primer on the murky world of sample size estimation.” Physical Therapy in Sport. 6(3). http://doi.org/10.1016/j.ptsp.2005.05.004

11. http://econpapers.repec.org/software/bocbocode/s456416.htm

12. Wooldridge, J. M. (2012). Introductory Econometrics: A Modern Approach (Upper Level Economics Titles) (5th edition). Cengage Learning.

13. Hamermesh, D. S. and Parker, A. (2005). “Beauty in the classroom: Instructors’ pulchritude and putative pedagogical productivity.” Economics of Education Review, 24(4), 369–376.

14. http://www.ianwatson.com.au/stata/tabout_tutorial.pdf

15. A Guide to Econometrics by Peter Kennedy. Econometrics by Example by Damodar Gujarati. Introductory Econometrics: A Modern Approach by Jeffrey Wooldridge.

Chapter 12. Data Mining for Gold

Does an extramarital affair lead to heartbreak? A brief answer is yes. A detailed response hints at more than one heartbreak. Typically, the faithful spouse is the one with the broken heart when an affair is discovered. However, the cheating spouse must also brace for a heart break, literally. New research shows that the cheating spouse faces a higher risk of heart-related ailments than the faithful spouse does, and this is not all. The sexual prowess of the paramour (the third leg in the relationship stool) determines largely the extent of the cardiac discomfort to the cheating spouse. Research shows that the odds for a cheating spouse facing a Major Adverse Cardiovascular Event (MACE) are high when the paramour depicts a strong sexual desire. Some, perhaps many, would still consider an affair worth the risk; the heart wants what the heart wants.

Ray C. Fair in 1978 published the seminal work on the economics of extramarital affairs.¹ Since then, several studies have tried to determine why spouses have an affair or, more importantly, how prevalent affairs are among the married cohorts. Research in economics, sociology, and, as of late, in epidemiology has explored the dynamics of extramarital affairs. Of course, religious texts in the Abrahamic traditions have something to contribute on this matter. Thou shalt not covet thy neighbor’s wife has been a part of the religious doctrine. Sexual transgressions have been looked on unfavorably by the religious establishment of all persuasions. In some religions, the consequences could be more drastic. For instance, under the strictest interpretation of the Islamic jurisprudence, a married Muslim man found guilty of fornication could be sentenced to death by stoning. The risks indeed are not just to the heart, but also to life!

This chapter focuses on data mining, which is a term to describe a collection of statistical and machine-learning algorithms for discovering trends and patterns in data. Machine-learning algorithms are unique in that they analyze data while learning from data, and in the process modify the algorithm for improved analysis. I subject data from Professor Fair’s research to data mining algorithms to determine what traits correlate with extramarital affairs (Fair, 1978). In particular, I would like to determine the impact of individuals’ religious beliefs on their propensity to engage in extramarital affairs. We already know that the Abrahamic traditions abhor a little extra indulgence. In this chapter, you will learn to determine:

• Whether those who take the gospel seriously are less or more likely to indulge in an extramarital affair

• Whether individuals satisfied with their marriage are less likely to have an affair

• Whether the duration of marriage or presence of young children influence the likelihood of one having an extramarital affair

• Whether men and women differ in the propensity to have an extramarital affair

CAN CHEATING ON YOUR SPOUSE KILL YOU?

Extramarital affairs are as old as the institution of marriage. However, the narrative around affairs has been contextualized in religious interpretations. As a result, the nomenclature of extramarital affairs includes terms such as cheating, unfaithful, heartbreak, and sin. If that was not enough, the epidemiologists joined the fray by suggesting higher risks of heart ailments for unfaithful spouses.

The fact remains that most, if not all, societies attach a great premium to fidelity. Marriage is no exception. However, in spite of almost universal support for fidelity, we know that affairs do happen. What we do not know with certainty is the prevalence of affairs. The low estimates put the number below 5%. In an unpublished paper, Michael and others (1993),² cited in Smith (2006), estimated that in a given year, 3%–4% of the currently married people had sex with a partner beside their spouse. They further estimated that about 15% to 18% of ever-married people have had a sexual partner other than their spouse while married.”³

Anderson (2006) reviewed 67 studies on paternity conflicts to conclude that approximately 1.9% of the men who were highly confident of being the father were not the biological father of the child.⁴ These men were ignorant of the fine details related to the paternity of the children they were raising as their own.

While the religious discourse on extramarital affairs has existed for several millennia, the literature on the cardiac well-being of those who cheat is recent. Fisher and others (2011) found a deleterious effect of extramarital affairs on the cardiovascular system. More than 8% of the respondents (married men) in their survey reported a stable secondary relationship. A follow up a few years later revealed 95 incidences of Major Adverse Cardiovascular Events (MACE), of which eight were fatal. They found that men whose paramours did not experience a lack or absence of sexual desire were more likely to experience MACE. Put differently, “infidelity induces not only heart trouble in the betrayed partners, but seems to be also able to increase the betrayer’s heart-related events,” noted Fisher and others.

Are Cheating Men Alpha Males?

Some research suggests that the men who cheat are psycho-biologically inclined to do so. The cheating males report significantly higher “testosterone levels and testis volume,” a lower prevalence of lack of sexual desire and erectile dysfunction (Fisher et. al., 2012).⁵ In fact, some research suggests that the characteristics common among cheating males are similar to those that qualify one as an alpha male. “As unfaithful subject seems to be an alpha male, a sort of super hero, with a better hormonal milieu and better vascular function.” Should women then be concerned about the fidelity of their alpha male partners?

When we dig deep into data, we realize that those who cheated on their spouses were in fact more likely to suffer from cardiovascular disease (CVD). At the beginning of the study, Fisher and others observed that the married men who reported a secondary stable relationship were about two times more likely to be suffering from a CVD. Could it be true that the men engaging in extramarital affairs were predisposed to CVD, and the causal effect of extramarital affair on CVD might not hold true?

UnFair Comments: New Evidence Critiques Fair’s Research

Since the publication of Fair’s analysis, some researchers, equipped with advanced statistical methods, have revisited the study. In particular, they question Fair’s findings about the effect of duration of marriage on the likelihood of having an extramarital affair. Fair found that the likelihood of an extramarital affair increases with the duration of marriage. Li and Racine (2004), using a different statistical method, concluded that the duration of marriage, in the presence of other covariates, was not a statistical predictor for individuals’ likelihood of engaging in extramarital affairs.⁶

Fair (1978) used two data sets that produced slightly different results. Of particular interest is the finding that the likelihood of an extramarital affair was higher for professionals than it was for blue-collar workers. At the same time, the likelihood was found to decline with education. Because both education and professional status are essentially proxies for income, the two variables returned counterintuitive results. This prompted Ian Smith (2012) to revisit the question with new data sets.⁷ He concluded that higher socioeconomic status resulted in a greater likelihood for extramarital affairs. At the same time, college education was found to be negatively correlated with extramarital affairs when the occupation status was held constant.

Before applying the data mining algorithms to Fair’s data set, I offer a brief introduction of select data mining concepts and apply them to Fair’s data set and another data set on precipitation/weather.

DATA MINING: AN INTRODUCTION

Data mining applications have become increasingly numerous with the advent of big data, inexpensive computing, and efficient data storage platforms. Both the struggling and thriving retailers have turned to data mining to learn more about their customers’ habits and preferences to expand sales and minimize costs. Advanced algorithms can do a good job of predicting consumer needs.

Banks and financial institutions use data mining algorithms to identify fraudulent transactions. If you own a major credit card, you might have, at some point in the past, received a call from the financial institution inquiring about your recent transactions, which were flagged by their data mining algorithms as purchases unlikely to have been made by you. Even before you realized that your credit card had been fraudulently used by others, your financial institution intervened to prevent further losses from taking place.

At times, the analytics might venture too far afield such that individuals may feel their privacy is being breached. The infamous incident at a Target store in Minneapolis serves as a good example. Andrew Pole, a statistician working for the retail chain, developed a maternity prediction model by analyzing the recent purchases made by women. A year after the model was in operation, an angry father walked into the Target store in Minneapolis and showed manager coupons and clippings about baby formula, cribs, and pregnancy clothes the retail chain had sent to his teenage daughter. He was furious with the retailer whom he thought was trying to encourage his teenage daughter to become pregnant. A few weeks later, though, the father became aware of his daughter’s pregnancy, which had been kept from him. What the father didn’t know, Target did.⁸

The question then emerges about how much of predictive analytics is of benefit to society. Using the same data mining tools, insurance companies can refuse to enroll individuals who might be predisposed to diseases that could require expensive treatments. Even if one were ignorant of one’s likelihood of becoming ill, advanced algorithms can predict the likelihood of one’s future well-being from purchases and genetic predispositions. Such advanced analytics could serve to benefit individuals and societies by alerting those predisposed to future illnesses. For instance, Angelina Jolie, a leading Hollywood actress, has undergone proactive surgeries to remove tissues that could have caused cancer in the future.⁹ She did so because her genetic makeup predicted the likelihood of her becoming ill in the future. Yet, low-income individuals, who often have no or inadequate health insurance, could be denied health insurance later because analytics might predict their odds for higher insurance claims.

There is, though, a limit to what data mining could reveal. Despite the advances being made in statistical analysis and algorithms, predictive analytics is still far from being a perfect science. While Target could identify a pregnant teenager, much sooner than her father did, it still did not prevent the same retailer from losing billions of dollars in Canada. In January 2015, the giant retailer suddenly disclosed several billion dollars in losses after operating in Canada for less than two years. Target could not be profitable for another six years and thus decided to close the 133 stores in Canada.¹⁰

What went wrong at Target? A lot. However, the most important thing to remember is that no amount of analytics and predictive modeling can come even close to replacing sound business management and execution. Generating insights from analytics and data mining is one thing; executing the plan efficiently is completely another.

Let’s begin with the discussion of what is meant by data mining. There are two divergent approaches to data mining. One is the traditionalist approach of considering data mining synonymous to statistical analysis. The other group insists that it differs from the traditional statistical analysis approach and leans toward unsupervised, machine-learning algorithms. I consider both approaches correct. The debate is somewhat superficial because it focuses on methods rather than the objectives. Data mining should be taken at its face value: an attempt to explore hitherto unknown trends and insights by subjecting data to analysis. The search for the “models” hidden in the data should not be restricted to certain methods.

Statistical methods rely on probability distributions whereas machine-learning methods are based on algorithms. Ultimately, the two approaches help us learn new facts about the phenomenon we study. The methods used should be flexible enough for us to discover not only what we are searching for, but also trends and facts that might not be on our radar. The known unknowns are rather easier to find than the unknown unknowns. Data mining should be equally good for both.

SEVEN STEPS DOWN THE DATA MINE

Ultimately, we analyze data to gain insights that could help us with smart decision-making. Fong and others (2002) offer a seven-step approach to data mining to support smart decisions.¹¹ These are the following:

1. Establish data mining goals

2. Select data

3. Preprocess data

4. Transform data

5. Store data

6. Mine data

7. Evaluate data mining results

The sections that follow describe the seven steps in more detail.

Establishing Data Mining Goals

The first step in data mining requires you to set up goals for the exercise. Obviously, you must identify the key questions that need to be answered. However, going beyond identifying the key questions are the concerns about the costs and benefits of the exercise. Furthermore, you must determine, in advance, the expected level of accuracy and usefulness of the results obtained from data mining. If money were no object, you could throw as many funds as necessary to get the answers required. However, the cost benefit trade-off is always instrumental in determining the goals and scope of the data mining exercise. The level of the accuracy expected from the results also influences the costs. High levels of accuracy from data mining would cost more and vice versa. Furthermore, beyond a certain level of accuracy, you do not gain much from the exercise, given the diminishing returns. Thus, the cost benefit trade-offs for the desired level of accuracy are important considerations for data mining goals.

Selecting Data

The output of a data mining exercise largely depends upon the quality of data being used. At times, data are readily available for further processing. For instance, retailers often possess large databases of customer purchases and demographics. On the other hand, data may not be readily available for data mining. In such cases, you must identify other sources of data or even plan new data collection initiatives, including surveys. The type of data, its size, and frequency of collection have a direct bearing on the cost of data mining exercise. Therefore, identifying the right kind of data needed for data mining that could answer the questions at reasonable costs is critical.

Preprocessing Data

Preprocessing data is an important step in data mining. Often raw data are messy, containing erroneous or irrelevant data. In addition, even with relevant data, information is sometimes missing. In the preprocessing stage, you identify the irrelevant attributes of data and expunge such attributes from further consideration. At the same time, identifying the erroneous aspects of the data set and flagging them as such is necessary. For instance, human error might lead to inadvertent merging or incorrect parsing of information between columns. Data should be subject to checks to ensure integrity. Lastly, you must develop a formal method of dealing with missing data and determine whether the data are missing randomly or systematically.

If the data were missing randomly, a simple set of solutions would suffice. However, when data are missing in a systematic way, you must determine the impact of missing data on the results. For instance, a particular subset of individuals in a large data set may have refused to disclose their income. Findings relying on an individual’s income as an input would exclude details of those individuals whose income was not reported. This would lead to systematic biases in the analysis. Therefore, you must consider in advance if observations or variables containing missing data be excluded from the entire analysis or parts of it.

Transforming Data

After the relevant attributes of data have been retained, the next step is to determine the appropriate format in which data must be stored. An important consideration in data mining is to reduce the number of attributes needed to explain the phenomena. This may require transforming data. Data reduction algorithms, such as Principal Component Analysis (demonstrated and explained later in the chapter), can reduce the number of attributes without a significant loss in information. In addition, variables may need to be transformed to help explain the phenomenon being studied. For instance, an individual’s income may be recorded in the data set as wage income; income from other sources, such as rental properties; support payments from the government, and the like. Aggregating income from all sources will develop a representative indicator for the individual income.

Often you need to transform variables from one type to another. It may be prudent to transform the continuous variable for income into a categorical variable where each record in the database is identified as low, medium, and high-income individual. This could help capture the non-linearities in the underlying behaviors.

Storing Data

The transformed data must be stored in a format that makes it conducive for data mining. The data must be stored in a format that gives unrestricted and immediate read/write privileges to the data scientist. During data mining, new variables are created, which are written back to the original database, which is why the data storage scheme should facilitate efficiently reading from and writing to the database. It is also important to store data on servers or storage media that keeps the data secure and also prevents the data mining algorithm from unnecessarily searching for pieces of data scattered on different servers or storage media. Data safety and privacy should be a prime concern for storing data.

Mining Data

After data is appropriately processed, transformed, and stored, it is subject to data mining. This step covers data analysis methods, including parametric and non-parametric methods, and machine-learning algorithms. A good starting point for data mining is data visualization. Multidimensional views of the data using the advanced graphing capabilities of data mining software are very helpful in developing a preliminary understanding of the trends hidden in the data set.

Later sections in this chapter detail data mining algorithms and methods.

Evaluating Mining Results

After results have been extracted from data mining, you do a formal evaluation of the results. Formal evaluation could include testing the predictive capabilities of the models on observed data to see how effective and efficient the algorithms have been in reproducing data. This is known as an in-sample forecast. In addition, the results are shared with the key stakeholders for feedback, which is then incorporated in the later iterations of data mining to improve the process.

Data mining and evaluating the results becomes an iterative process such that the analysts use better and improved algorithms to improve the quality of results generated in light of the feedback received from the key stakeholders.

RATTLE YOUR DATA

In this section, I illustrate data mining concepts using a specialized Graphical User Interface (GUI) for data mining, Rattle, that runs the algorithms programmed in R. This chapter differs from others in the book because here I rely on the point-and-click functionality of the GUI, whereas in other chapters I displayed the command line interface for R and other software.

Graham Williams is the programmer behind Rattle. He has authored a book, Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery, which explains the functionality of Rattle.¹² I strongly recommend the text for those interested in applied data mining. Rattle can be installed free from within R. Details on installation and other supporting materials are available from the Graham Williams at http://www.togaware.com/.

Rattle has been designed with data mining in mind. After the GUI is up and running, a series of tabs laid out in sequence from left to right appears (see Figure 12.1). Note that the first tab, Data, offers options to import data in Rattle from a variety of data formats. Other options available from the Data tab include partitioning the data for estimation and testing purposes. Partitioning of data allows analysts to estimate the model using a subset of the data and then apply the model to the unused part of the data to test how well the model performs for “out-of-sample” forecasting. The same tab displays a data dictionary offering details about variable names and type (continuous or numeric versus categorical) and options for identifying the dependent variable as “Target,” and excluding variables using the “Ignore” option. The tab identifies how many unique values are in a particular variable. You can also change the type of the Target variable.

Figure 12.1 Rattle graphical user interface displaying data attributes for the affairs data set

Rattle follows a logical order to data mining. After the data has been imported into Rattle and converted to R’s native format, the next step involves exploring the data set using the Explore tab. This could entail reporting summary statistics and generating insightful charts. Afterward, the analyst may want to test the relationship between variables using statistical tests, which are available from the Test tab. At this stage, the data scientist might want to transform variables to see whether different types of tests could be conducted. Data transformation options are available under the Transform tab. The following three tabs, Cluster, Associate, and Model, offer functionality for advanced data mining methods. The last tab, Log, maintains a sequential log of the commands executed in the session. You can download the log file to edit it for reproducing the work later.

To launch Rattle, I first launch R and type:

library(rattle)
rattle()

As mentioned earlier, Rattle launches in the Data tab. If I were to click the Execute button, Rattle would offer the option to launch the practice data set. Otherwise, you can opt to select your own data. I illustrate the data mining concepts using the Affairs data set, which you can download from the book’s website at www.ibmpressbooks.com/title/9780133991024.¹³

Also note that in the case where Rattle requires a particular package that does not exist in your R library, it will offer the option and instructions to download the package and then proceed with installation and subsequently the analysis.

What Does Religiosity Have to Do with Extramarital Affairs?

In this section, I illustrate the use of data mining techniques using one of the data sets analyzed by Ray C. Fair in 1978. Remember, you can download the data from the course’s website from the material listed under Chapter 12. Table 12.1 describes the data set.

Table 12.1 Affairs Data Set

You move from the Data tab to the Explore tab to compute summary statistics for the selected variables by selecting the Summary radio button, again checking the Summary box in the options listed, and clicking the Execute button to run the command. Figure 12.2 shows the results. These results are similar to the one obtained by R’s Summary command. Note that frequency tabulation is presented for categorical variables (gender) and mean and other distributional statistics are presented for continuous variables. Also, note that the data set treats education, occupation, rating, religiousness, yearsmarried, and affairs as continuous variables. However, these could also be treated as factor or discrete variables.

Figure 12.2 Summary statistics using the Explore tab

Cross-tabulations are also available under Summary. Running a cross tabulation between gender and affairs produces the output shown in Figure 12.3. Note that the output presents the count, the expected count, the chi-square contribution, and values presented as a percent of the row, column, and table total. Thus, you can see that 77% of the females compared to 72.7% of the males reported having no affair. At the same time, the same proportion of males and females (6.3%) reported having more than 10 affairs in the past 12 months. I can also report from Figure 12.3 that 53.9% of those who did not report an affair were women in the sample.

Figure 12.3 Cross tabulation between gender and the number of affair

The Explore tab offers comprehensive functionality for presenting results as graphics. To activate graphing capabilities you click on the Distributions radio button. You can select the number of plots to be presented on the same page. A different set of plots are available for continuous and categorical variables. Figure 12.4 shows selections for four plots per page and four variables for histograms. Figure 12.5 shows the resulting histograms.

Figure 12.4 Dialog box for generating graphics

The histogram in the top-left corner presents the distribution of yearsmarried. The most frequently occurring cohort comprises those married for 12 years or more. The histogram for religiousnessindicates that anti-religion individuals constituted a small minority in the data set. Most respondents report to be either somewhat or very religious. The distribution of education variable suggests that “some college education” was the most frequent response. At the same time, a majority of the respondents reported at least a college degree. Lastly, most respondents were happy or very happy with their marriage, as is evident from the histogram in the bottom-right quadrant in Figure 12.5.

Figure 12.5 Histograms plotted for four continuous variables

R’s unique graphing capabilities are also available with Rattle. The traditional approach to explore correlation between variables is to generate a correlation matrix depicting the Pearson correlation statistics. R and Rattle can do better. The correlation matrix can instead be presented as a graphic. Figure 12.6 shows the correlation matrix for select variables. The upward sloping lines running along the diagonal represent the perfect correlation between the variable with itself. The white circles and ovals represent weak or no correlation between variables. The downward sloping, gray shaded ovals represent negative correlation. The shades become darker and the shapes become narrower corresponding to the strength of correlation. Positive correlation is represented by upward sloping gray shaded ovals. R by default differentiates negative and positive correlations with different colors.

Figure 12.6 A graphical distribution of the correlation matrix

A negative and strong correlation exists between affairs and marriage satisfaction rating, which is represented by a downward sloping, dark shaded oval. The lighter shade and wider shapes for the correlation between affairs and religiousness/education implies that a relatively weaker correlation exists between these variables. Occupation type, age, and the duration of marriage do not report strong correlation with affairs. At the same time, a strong positive correlation exists between age and years of marriage. Similarly, a negative correlation exists between duration of marriage and satisfaction with the respondents’ marriage.

The Principal Components of an Extramarital Affair

Working with a data set with not only a very large number of records, but also a large number of attributes or variables is quite common now. Furthermore, it may happen that several variables may serve as proxies for a phenomenon that could not be readily measured, such as aptitude. Such situations might require two interventions: reducing the number of variables being used in the analysis and computing one or more new variables from the existing variables that essentially serve as proxies for the hard-to-measure variable. Data scientists can turn to Principal Component Analysis (PCA) for this task.

Essentially, PCA generates a new set of variables called principal components or eigenvectors that explain almost all variance explained by the original variables. The first eigenvector is the one that explains the most variance in the data set. This implies that no possible combination of original variables will be able to explain more of the total variance. The subsequent eigenvectors are in descending order of their ability to explain the total variance. Each eigenvector is a linear function that resembles a regression model, minus the intercept.

Assume that we perform PCA on 10 variables, which were normalized to mean = 0, and variance =1. Let’s assume that the total variance explained by the first eigenvector is 4.0. Because the total number of variables is 10, we can calculate * 100 to conclude that the first eigenvector explains 40% of the variance in the data set. Stated otherwise, the first eigenvector explained as much variance as the four original variables in the data set did.

The amount of the total variance explained by an eigenvector is known as eigenvalue. The subsequent eigenvectors maximize the amount of remaining variance. This implies that the variance explained by the second or subsequent eigenvectors is independent of, or uncorrelated with, the first eigenvector. This allows us to add the variance of the eigenvectors to determine how much variance is explained cumulatively by the eigenvectors. We can also say that eigenvectors are orthogonal to each other. The number of eigenvectors required to explain the total variance is called rank.

Because PCA is considered a data reduction technique, having an idea about how many eigenvectors to use might be useful. Replacing the variables with as many eigenvectors will defeat the very purpose of using data reduction strategies. Data scientists might want to use as many eigenvectors as are needed to satisfy an arbitrary threshold for explaining variance, such as 75%.

Interpreting the eigenvectors may at times be difficult. You can try interpreting them by reviewing the factor loading coefficients, which are similar to the coefficients estimated by the regression models for the corresponding variables. The coefficients report the correlation between a given eigenvector and the corresponding variable. The correlation could be either positive or negative. The informal general rule is to consider a factor loading of 0.3 or more as substantial. A factor loading of 0.3 suggests that the variable and the eigenvector share 9% (0.3² * 100) of their variance.

For additional details on PCA, consult Jolliffe (2014).¹⁴

Will It Rain Tomorrow? Using PCA For Weather Forecasting

Because PCA is suited for continuous variables, let’s now turn to the weather data set that comes bundled with the Rattle GUI. Recall that most variables in the affairs data set are categorical and hence not suited for PCA. This example shows running PCA on 14 variables that report atmospheric conditions, such as temperature, humidity, wind speed, evaporation, sunshine, and the like. The data set is readily available from within the R environment. You can learn more about the data set from within R and Rattle. I have made it available from the book’s website in other software formats. Figure 12.7 presents the first 10 eigenvectors and the factor loadings corresponding to the 14 variables.

Figure 12.7 Output from PCA, factor loadings

What is more interesting is the proportion of variance explained by each eigenvector. Figure 12.8 shows the results. Note the first eigenvector explains 38.5% of the variance. Together, the first six eigenvectors, PC1 to PC6 represent 90.3% variance that was explained by the 14 original variables. You can substitute these eigenvectors instead of the original variables. The added advantage is that these eigenvectors are not correlated, which implies that by using the eigenvectors as explanatory variables in a regression model, you would not need to be concerned with issues related with multicollinearity.

Figure 12.8 Proportion of variance explained by eigenvectors

Do Men Have More Affairs Than Females?

The Test tab provides options to conduct several statistical tests. Assuming that the affairs variable is continuous, in this example I conduct a t-test to determine whether men on average have more affairs than women do. Figure 12.9 presents the results, where x represents women and y represents men. The t-statistics of –0.2873 suggests that the average number of affairs do not differ statistically between men and women. The output presents results for equal and unequal variances. In addition, the p-value for the three possible outcomes, namely equal, less than, and greater than, are also reported.

Figure 12.9 t-test to test whether men on average have more affairs than women do

Two Kinds of People: Those Who Have Affairs, and Those Who Don’t

Robert Benchley, an American humorist once said, “There are two kinds of people in the world, those who believe there are two kinds of people in the world, and those who don’t.” I also believe that the universe of married individuals can be divided into two groups: those who have extramarital affairs and those who do not. The challenge is to determine how to divide the married populations into two or more groups.

The reality is that when it comes to marital infidelity, there are more than two groups or clusters of people. Fortunately, data mining offers several options to cluster observations into homogenous groups. One such technique, and not necessarily the best, is the K-means clustering. For additional details, see Everitt & Hothorn (2011, page 175).¹⁵ The clustering options are available under the Clustertab in Rattle.

To illustrate the clustering example, I select a subset of variables from the Cluster tab. I want to know what kind of demographic and other attributes are correlated with the propensity to have affairs. Using the options available in the dialog box, I select five clusters so that each observation in the data set belongs to one of the five clusters. Figure 12.10 presents the results.

Figure 12.10 K-mean clustering of affairs data

The output is structured as follows. Cluster sizes report the number of observations belonging to each cluster. The largest cluster contains 141 observations, and the smallest 27 observations. The output also produces the average values for each variable such that the average number of affairs is 1.3 and the average age is 32.75. The third set of output reports the average statistics for each cluster. The second cluster reports the highest average value of affairs at 9.96. Individuals falling in the affair-happy category have an average age of 38.3 years, average marriage duration of 14.3 years, and average religiosity of 2.81. At the same time, individuals belonging to this cluster rate their marriages much lower than respondents belonging to other clusters.

Cluster 3 reports the lowest average value for affairs at 0.39. Note that the average religiousness for these individuals is 3.38, which is also the second highest average value for religiousness for any cluster. Thus, we can see that individuals clustered together depicting the lowest propensity for affairs also report a higher degree of religiousness.

To view the output from the clustering algorithm as a graphic, see Figure 12.11, which shows the five clusters scattered in a two-dimensional space. You can see that observations belonging to a particular cluster are bunched together.

Figure 12.11 Graphical display of clusters

Models to Mine Data with Rattle

Rattle offers a variety of modeling alternatives under the Model tab (see Figure 12.12). This section covers two commonly used data mining algorithms, namely Decision Trees and Neural Networks.

Figure 12.12 Modeling options available with Rattle

Applying Decision Trees to the Affairs Data

Rattle fits Decision Trees using binary recursive partitioning. The data are successively partitioned for explanatory variables such that two groups emerge around a certain threshold for the variable. One group contains observations below the threshold and the other above the threshold. The splitting continues until a variable is unable to partition data given the homogeneity in the remaining observations. For additional details on decision trees, you can review Quinlan (1986).¹⁶

I apply the Decision Trees model to the Affairs data. The dependent variable is affairs. I use the Conditional option for Decision Trees. The output in Rattle identifies the dependent and explanatory variables along with the empirical output for the nodes in the estimated Decision Tree (see Figure 12.13). Clicking Draw presents the results graphically.

Figure 12.13 Decision Tree output for the Affairs data set

The first node in Figure 12.13 splits the data along the rating variable such that those who were not satisfied with their marriage and rated it less than or equal to two fall under one category and those who were satisfied with their marriage and rated it greater than two fall in a separate category. Those who are unsatisfied with their marriage are not further categorized by other variables. For them the results are presented in Node 2 that contains 58 respondents with the average number of affairs reported above six. Thus, we can see that satisfaction with one’s marriage is an important determinant of extramarital affairs.

Those who rank their marriage for two or more see another split along religiousness. Those who are less religious and ranked religiosity to be less than or equal to three fall into one group and the rest in the other. Those who are more religious are further subdivided along age into two groups; that is, less than 42 years old and greater than 42 years old. The two nodes under age do not appear to differ much. However, there is greater dispersion in the number of affairs under Node 9 than under Node 8.

Those respondents who are less religious are further split along the duration of marriage. Those with a the marriage duration of more than 7 years have relatively higher value for reported affairs than those whose marriage duration was less than 7 years.

From the Decision Tree in Figure 12.13, you can conclude the following. The most important determinant for having an affair is how one rates one’s marriage. Those who are less satisfied with their marriage are more likely to have extramarital affairs than those satisfied with their marriage. Also, those who are more religious are less likely to have an extramarital affair. Furthermore, those who are less religious and have been married for more than seven years are also likely to have extramarital affairs.

Neural Networks of Extramarital Affairs

Neural Networks are also a very popular data-mining tool. These are commonly regarded as a black box tool because you can specify the inputs and evaluate the outputs from the model, yet be limited in appreciating the inner workings of the model. Still, Neural Networks have gained a greater acceptance as a data-mining tool in the recent past. For additional details on Neural Networks I recommend Rojas (2013).¹⁷

This example uses the binary variable bin.af as the dependent variable and other sociodemographic characteristics as explanatory variables. Figure 12.14 shows the output from the Neural Network model.

Figure 12.14 Output from the Neural Network model

How well is the Neural Network model performing? It is hard to tell from the output in Figure 12.14. The output does not report any goodness of fit statistics for you to know how good the model fits or what explanatory variables are good predictors of the dependent variable.

The preferred way of evaluating the performance of a Neural Networks model is to see how accurately it has predicted the outcome. At the same time, I also estimate a binary logit model with the same inputs for comparing the outputs from the two modeling approaches. Figure 12.15 shows the output from the binary logit model.

Figure 12.15 Output from a binary logit model for the dichotomous variable, bin.af

Figure 12.15 shows that three statistically significant variables are duration of marriage, respondent’s religiousness, and marriage satisfaction. The output shows that the odds for an affair increase with the duration of marriage, but decline with religiousness and marital bliss. Also, the overall model fit, presented as the pseudo R-square, is 0.27. Please consult Chapter 8 for details on how to interpret a binary logit model.

The Evaluate tab in Rattle offers the functionality to test how good the model has fitted to the data and is able to predict incorporating the unused segment of the data set that was set aside for validation purposes. Figure 12.16 presents the results.

Figure 12.16 Error matrix for Neural Network and binary logit models

The first set of results in Figure 12.16 corresponds to the binary logit model where a 2 × 2 matrix of actual and predicted observations is presented. The binary logit model has done a good job of correctly categorizing all respondents who did not have an affair. However, the binary logit model erroneously categorized 22 out of the 26 respondents who had an affair as ones who did not have an affair. The overall error is reported as 0.244.

The output from the Neural Networks model is shown in the bottom part of Figure 12.16. The Neural Networks model wrongly categorizes 9 out of 64 respondents who did not have an affair. The neutral network model was equally poor as the binary logit model in categorizing those who had an affair. The overall error reported for the neural network model is higher than that for the binary logit model.

This comparison shows that machine-learning algorithms might not necessarily outperform the traditional parametric models. At the same time, the added advantage of using parametric models is that you can learn from the inner workings of the model, which is not possible with the black box approaches.

SUMMARY

The discourse in this chapter is intended only to be an introduction to the rich and rapidly evolving field of data mining. The topics covered constitute a very small subset of techniques and models being used for data mining purposes today.

Over the next few years, data mining algorithms are likely to experience a revolution. The availability of large data sets, inexpensive data storage capabilities, and advances in computing platforms are all set to change the way we go about data analysis. It is quite possible that the contents of this chapter will be dated in the next few years, or perhaps months. Such a rapid pace of development and innovation holds great promise for data scientists. The advances in computing and storage are also making analytics less expensive and the open source platforms are democratizing the landscape such that talented individuals scattered across the globe will have the opportunity to put their talents to work using the open source tools.

I am confident that with big and small data, advances in computing, and open source platforms, the analytics community will break new ground in finding cures for diseases and devising strategies to achieve a more equitable world where resources and riches are shared among all.

ENDNOTES

1. Fair, R. C. (1978). “A theory of extramarital affairs.” The Journal of Political Economy, 45–61.

2. Michael, R. T., Laumann, E. O., and Gagnon, J. H. (1993). The Number of Sexual Partners in the U.S.

3. Smith, T. W. (2006). American Sexual Behavior: Trends, Socio-Demographic Differences, and Risk Behavior. GSS Topical Report No. 25. National Opinion Research Center, University of Chicago.

4. Anderson, K. G. (2006). How Well Does Paternity Confidence Match Actual Paternity? Evidence from Worldwide Nonpaternity Rates. Current Anthropology, 47(3), 513–520.

5. Fisher, A. D., Bandini, E., Corona, G., and Monami, M. (2012). Stable extramarital affairs are breaking the heart. International Journal of Andrology, 35(1), 11–17.

6. Li, Q. I. and Racine, J. (2004). “Predictor relevance and extramarital affairs.” Journal of Applied Economics, 19(4), 533–535.

7. Smith, I. (2012). “Reinterpreting the economics of extramarital affairs.” Review of Economics of the Household, 10(3), 319–343.

8. Duhigg, C. (2012, February 16). How Companies Learn Your Secrets. The New York Times. Retrieved from http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html.

9. Pitt, A. J. (2015, March 24). Angelina Jolie Pitt: Diary of a Surgery. The New York Times. Retrieved from http://www.nytimes.com/2015/03/24/opinion/angelina-jolie-pitt-diary-of-a-surgery.html.

10. Shaw, H. (2015, January 15). Target Corp to exit Canada after racking up billions in losses. The Financial Post. Retrieved from http://business.financialpost.com/news/retail-marketing/target-corp-calls-it-quits-in-canada-plans-fair-and-orderly-exit.

11. Fong, A. C. M., Hui, S. C., and Jha, G. (2002). “Data mining for decision support.” IT Professional, 4(2), 9–17. http://doi.org/10.1109/MITP.2002.1000455

12. Williams, G. (2011). Data mining with Rattle and R: the art of excavating data for knowledge discovery. Springer Science & Business Media.

13. https://sites.google.com/site/econometriks/

14. Jolliffe, I. (2014). “Principal Component Analysis.” In Wiley Stats Ref: Statistics Reference Online. John Wiley & Sons, Ltd.

15. Everitt, B. and Hothorn, T. (2011). An introduction to applied multivariate analysis with R. Springer Science & Business Media.

16. Quinlan, J. R. (1986). “Induction of Decision Trees.” Machine Learning, 1(1), 81–106.

17. Rojas, R. (2013). Neural networks: a systematic introduction. Springer Science & Business Media.

https://www.safaribooksonline.com/library/view/getting-started-with/9780133991246/ch12.html#ch12

Data Science

Chapter 1. The Bazaar of Storytellers

DATA SCIENCE: THE SEXIEST JOB IN THE 21ST CENTURY

STORYTELLING AT GOOGLE AND WALMART

GETTING STARTED WITH DATA SCIENCE

Do We Need Another Book on Analytics?

Repeat, Repeat, Repeat, and Simplify

Chapters’ Structure and Features

Analytics Software Used

WHAT MAKES SOMEONE A DATA SCIENTIST?

Existential Angst of a Data Scientist

Data Scientists: Rarer Than Unicorns

BEYOND THE BIG DATA HYPE

Big Data: Beyond Cheerleading

Big Data Hubris

Leading by Miles

Predicting Pregnancies, Missing Abortions

WHAT’S BEYOND THIS BOOK?

SUMMARY

ENDNOTES

Chapter 3. The Deliverable

THE FINAL DELIVERABLE

What Is the Research Question?

What Answers Are Needed?

How Have Others Researched the Same Question in the Past?

Sources of Information

Finders Keepers

Using Zotero

The Need for a Bigger Shoe Box

What Information Do You Need to Answer the Question?

What Analytical Techniques/Methods Do You Need?

THE NARRATIVE

The Report Structure

Have You Done Your Job as a Writer?

BUILDING NARRATIVES WITH DATA

“Big Data, Big Analytics, Big Opportunity”

A Multi-Billion Dollar Industry

Several Billion Reasons to Hop on the Analytics Bandwagon

Why Now?

Where to Next

Urban Transport and Housing Challenges

“You Don’t Hate Your Commute, You Hate Your Job!”

“Are Torontonians Spending Too Much Time Commuting?”

“Will the Roof Collapse on Canada’s Housing Market?”

Human Development in South Asia

“Bangladesh: No Longer the ‘Hungry’ Man of South Asia”

“Keeping Pakistan’s High Fertility in Check”

The Big Move

“Pakistani Canadians: Falling Below the Poverty Line”

“Dollars and Sense of American Desis”

Sizing the South Asians

Pakistanis Have Larger Families

Women in the Labor Force

Education Matters the Most

Better Education Better Careers

What We Have Learned

“The Grass Appears Greener to Would-Be Canadian Immigrants”

“Bordered Without Doctors”

SUMMARY

ENDNOTES

Chapter 7. Why Tall Parents Don’t Have Even Taller Children

THE DEPARTMENT OF OBVIOUS CONCLUSIONS

Why Regress?

INTRODUCING REGRESSION MODELS

All Else Being Equal

What Questions Can We Pose to Regression Models?

If Suburbs Make Us Fat, They Must Also Make Us Pregnant

Holding Other Factors Constant

Do Tall Workers Earn More Than the Rest? Not When All Else Is Equal

Spuriously Correlated

A Step-By-Step Approach to Regression

Learning to Speak Regression

The Math Behind Regression

Ordinary Least Squares Method

Estimating Parameters by Hand

Residuals (Error Terms) and Their Properties

Is the Model Worth Anything?

Errors Squared

Coefficient of Determination r2

Are the Relationships Statistically Significant?

Coefficient of Determination r²

Inferences Concerning β₀