Friday, April 27, 2018

Sex, Lies, and Data Profiles

The title of this blog could have been the title for Seth Stephens-Davidowitz's book Everybody Lies. As he explained in his interview with Freankonomics' Stephen Dubnerhe knew the title is inaccurate, though he was told that "98% of people lie" wouldn't sell well. In case you were wonderings about the Cretan or Liar's Paradox implied by the title he opted for, he assures the interviewer that he is among the 2% of honest people. But didn't he then lie in the title? And is he really as honest as he claims? This blog examines the second question.

Ultimately, what makes both data reports from people who are presented as experts and data visualization so effective at conveying a point is that they don’t require much analysis on the viewer’s end because they’ve already done the thinking for you. That’s both seductive and potentially misleading.

That’s exactly why we have to be careful about not merely accepting the visually expressed story at face value. Any data visualization should be subjected to a triple C test with a check for context, correlation, and causation that I wrote about for Baseline  here.

In light of how openly political media and companies that handle data have grown, there's clearly a need for a few more C words to keep in mind when presented what is offered as objective data:

  • Correspondence to reality. Just because someone claims expertise doesn't mean they are completely correct about their assertions. For example, when I was in labor with my first baby, the doctors and nurses at the hospital just dismissed my pains, claiming the contractions were "mild" and that the birth was far from imminent. I was not the expert; they were, but I knew that I felt the baby coming. As it turned out, the resident barely got to me in time. I learned from that experience that you should not be gaslighted by expert views that directly contradict not what you just think you know but what you do know and directly experience. 
  • Convenience: This pertains to both means and ends. Convenience of means refers to using the data that is on hand or easily measured even if it's not necessarily the data that is the most relevant. It's rather like measuring how much snow fell on your windowsill because it's easy to reach rather than going out to get the measure on the street and in drifts to get a more accurate measurement. Convenience for ends is about selecting data that you can easily fit into the conclusion you wish to draw AKA cherry picking. 
  • Confirmation Bias:In general, when you look for data on something, you have to bear in mind that absolute objectivity is rare. Many of us have deeply-seated values and beliefs that will not allow us to entertain the possibility that we are on the wrong track,which would skew our results because of what we allow and disallow in the data set. It is the equivalent to painting a bull's eye around where your arrow went. So ask yourself, does the person have some personal agenda that could be coloring the outcome? If so you should treat them with the same healthy skepticism you would treat cigarette tobacco studies sponsored by tobacco companies. 
  • Certainty Camouflaging Contingencies: Few things are absolutes, so if someone states something without qualifiers, likely something is being hidden or glossed over -- like the fact that the data is out of date or taking searches of racist terms and jokes as proxies for the person being a racist and then shifting labels from what actually is measured to what the person says is signified by the measurement. This leads to a triple F: Fudging Figures and Facts.

Incidentally, Seth Stephens-Davidowitz takes no chances that you won't recognize him as an expert. Right on p. 1, he declares, "I am an internet data expert." I don't make any such claim, though I have been delving into question of big data since 2011 and regularly review data science student work. But unlike Stephens-Davidowitz, I didn't work for Google. It was actually a team from Google that originally inspired me to write up the piece on not believing everything you see in data visualizations.

The data visualization a the beginning of Everyone Lies( p.13) presents two maps of the US that intends to show a correlation that implies causation. Stephens-Davidowitz refers to having researched correlations of racist searches with voter patterns to argue that Obama lost votes to racism. However the argument he makes about Trump right at the beginning of his book is actually based on an assertion that Nate Silver made in a tweet in early 2016.
Nate Silver's tweet cites an article written by a different Nate with the last name Cohen to bolster his claim. So I went to his source: a New York Times article written by  published on December 31, 2015, Donald Trump’s Strongest Supporters: A Certain Kind of Democrat, and there are the maps  that appears in Stephens-Davidowitz's book.




The maps that are juxtaposed to indicate correlation and imply causation in the article that reappear (in grayscale) in Everybody Lies. In case the caption appears too small for you to read, I'll put it in text: "Source: Vote estimates by Congressional district provided by Civis Analytics; Google search estimates from 2004-7 by Seth Stephens-Davidowitz. How convenient! Stephens-Davidowitz already had that data set from when he gathered it to present evidence of racism at the time of Obama's election. So what if it was really past its sell by date in 2015, never mind in 2017, recycling is a good thing, isn't it?

Aside from the lack of color, there are two other differences in the maps that appear in the book: One it doesn't have the identifier by year. Two: the more cautious label applied in the newspaper illustration of "Where racially charged Internet searches are most common" is replaced by the more confidently asserted "Racist Search Rate." If you suspect that they are, in fact, different maps, I can only tell you to open the book and look for yourself to be assured that I am not misrepresenting anything. This is an example certainty camouflaging contingencies.

Here's my simple Venn diagram of an assertion Stephens-Davidowitz made in asserting in the live presentation, as he did in his book, that the biggest single predictor of a vote for Trump was being a racist. The two circles overlap almost completely.
My own illustration of Seth Stephens-Davidowitz's contention



But if you start looking at the data he used to justify this conclusion, you see it's not at all this simple.


It's true that Nate Cohen is hoping to insinuate the Trump support includes areas that tend to more racist, though he is smart enough to qualify the argument: "That Mr. Trump’s support is strong in similar areas does not prove that most or even many of his supporters are motivated by racial animus. But it is consistent with the possibility that at least some are. "


The article also reflects understanding that things are really not so black and white in Democrat vs. Republican presidential elections: "Many Democrats may now even identify as Republicans, or as independents who lean Republican, when asked by pollsters."


Remember the NY Times' article title? That's the main argument, not really the twist that Silver gave it, as many replies to his tweet pointed out: "Mr. Trump appears to hold his greatest strength among people like these — registered Democrats who identify as Republican leaners — with 43 percent of their support, according to the Civis data"

While the article merely suggests that racial attitude could be involved, Stephens-Davidowitz goes even further than Nate Silver's tweet, asserting that racism is the strongest indicator of a Trump vote. That is what he said in the live presentation I heard on April 19th. At the end of the event, I went up to him and asked how is it possible to link the person who searched for things like racist jokes with votes for Trump.

He admitted that would be impossible. Instead, he said, they look at the areas where Trump won and correlate that with areas where there have been searches he identifies as racist to draw this conclusion that racism was the definitive motivating factor in votes for Trump.

 He indicated that the correlations were made based on the verified fact of which states voted for Trump correlated with the type of Google searches that, he contends, identifies a person as racist, and that was conclusive enough for him.

What he failed to admit was that the correlations were not made on the basis of actual voting results but on earlier maps of projected Trump support.

 A look at the actual map of the election results shows a different story. The predictions included just a fraction of the states that did go to Trump, which means they failed to represent the voters overall, a serious failing in what is presented as comprehensive and accurate data.


Let's take a closer look. The maps paired by Stephens-Davidowitz imply a correlation between his findings of data searches that ended in 2007 with the Trump support assumed to be in place in 2015.So remember all the steps of remove we have here:

  1. 1.We have search data for racial jokes, the n-word, and the like, on which basis we are to assume that all (or at least most) voters of a particular state can be characterized as racist overall if the percentages of such searches are higher than average.
  2. 2. Furthermore, we must assume that in the course of nearly a decade, all the racists stayed in place and retained their views.While such an assumption of stasis may have worked a hundred years ago, it is somewhat doubtful that it can hold in the 21st century when things move at broadband speed.
  3. 3. We have to consider that the overlap of higher support for Trump and higher racism rankings are not just a correlation but an indication of a causal relationship, as he explicitly identified it as an accurate predictor.

The whole theory could possibly be woven together to appeal to those who already favor that outcome, but it doesn't hold water. Even if I'd grant Stephens-Davidowitz that most of the people who had conducted those internet search about a decade before the election stayed in place and did cast their vote for Trump, the visualization would look like this:














Of course, that is not wholly accurate either, as we don't have a clearly established relationship between the people who made racially charged searches back when their search data was collected and the voting citizens of the area in 2016. But this represents the fact that even if some racists are including in the voting pool for Trump, it doesn't define all the voters in that pool. As we can see from the maps of actual voting results with Stephens-Davidowitz's own map of racism, Trump voters were not confined to those states. See the comparison shown by the juxtaposition below.










This reveals that the correlation that Stephens-Davidowitz's points to is not nearly as causative as implied. First of all, some of the places he identified as leaning toward racism in the west actually voted for Clinton. Second of all, the states that did vote for Trump far exceed the ones identified as inclined toward racism. So inclusion of some states with racist searches in the Trump wins is not a definitive correlation because it only includes a portion (not a definitive majority) and fails to account for the voters overall.


Ultimately, what Stephens-Davidowitz's set of maps really shows is not conclusive proof that racism is the best predictor of votes. Instead, what we have in the argument is an illustration of of a confirmation bias that clings to outdated, misleading, and factually wrong representation even when we have access to data that disproves the theory. 

While maps of actual votes were available by the time he published his book,  he kept the map of incorrect predictions as the definitive map of Trump votes because it fits the hypothesis better. It just didn't fit the reality because with that limited support, he would not have won the election.

Sticking with old data sets because they are convenient -- both in terms of saving you research time and in terms of fitting what you want to prove -- is not true data science as it runs contrary to the essential value of science. Richard Feynman touched on this issue in a 1974 address to Caltech entitled Cargo Cult Science in which he explained that true science is about doing one's best "to give all of the information to help others to judge the value of your contribution; not just the information that leads to judgment in one particular direction or another."


This is not to say that we must put Trump on a pedestal.No matter whether you love the president,hate him, or like many others, are somewhat neutral and willing to judge based on results, you should still not distort data to support a particular narrative.


At this point, you may be thinking, "Well that's all politics, but what about the sex in the title?" It's there because Stephens-Davidowitz uses examples related to that to try to capture attention, as demonstrated by what he starts with in his book, his interview (cited above) and the live presentation I heard.

I didn't take on the deconstruction of that, though someone else did. See the second part of Chelsea Troy's review, Everybody Lies’ Review Part 2: Dangerous Methodology She brings up the issue of bad proxies and misleading numbers. I couldn't agree more with what she says here: "Just because there are some numbers floating around doesn’t make a study valid."





No comments:

Post a Comment