Don't buy me a coffee. Buy yourself a mug or some other Jane Austen product at my store.

Don't buy me a coffee. Buy yourself a mug or some other Jane Austen product at my store.
Find great gifts and party accessories for literature lovers in all price points.
Showing posts with label data. Show all posts
Showing posts with label data. Show all posts

Friday, April 27, 2018

Sex, Lies, and Data Profiles

The title of this blog could have been the title for Seth Stephens-Davidowitz's book Everybody Lies. As he explained in his interview with Freankonomics' Stephen Dubnerhe knew the title is inaccurate, though he was told that "98% of people lie" wouldn't sell well. In case you were wonderings about the Cretan or Liar's Paradox implied by the title he opted for, he assures the interviewer that he is among the 2% of honest people. But didn't he then lie in the title? And is he really as honest as he claims? This blog examines the second question.

Ultimately, what makes both data reports from people who are presented as experts and data visualization so effective at conveying a point is that they don’t require much analysis on the viewer’s end because they’ve already done the thinking for you. That’s both seductive and potentially misleading.

That’s exactly why we have to be careful about not merely accepting the visually expressed story at face value. Any data visualization should be subjected to a triple C test with a check for context, correlation, and causation that I wrote about for Baseline  here.

In light of how openly political media and companies that handle data have grown, there's clearly a need for a few more C words to keep in mind when presented what is offered as objective data:

  • Correspondence to reality. Just because someone claims expertise doesn't mean they are completely correct about their assertions. For example, when I was in labor with my first baby, the doctors and nurses at the hospital just dismissed my pains, claiming the contractions were "mild" and that the birth was far from imminent. I was not the expert; they were, but I knew that I felt the baby coming. As it turned out, the resident barely got to me in time. I learned from that experience that you should not be gaslighted by expert views that directly contradict not what you just think you know but what you do know and directly experience. 
  • Convenience: This pertains to both means and ends. Convenience of means refers to using the data that is on hand or easily measured even if it's not necessarily the data that is the most relevant. It's rather like measuring how much snow fell on your windowsill because it's easy to reach rather than going out to get the measure on the street and in drifts to get a more accurate measurement. Convenience for ends is about selecting data that you can easily fit into the conclusion you wish to draw AKA cherry picking. 
  • Confirmation Bias:In general, when you look for data on something, you have to bear in mind that absolute objectivity is rare. Many of us have deeply-seated values and beliefs that will not allow us to entertain the possibility that we are on the wrong track,which would skew our results because of what we allow and disallow in the data set. It is the equivalent to painting a bull's eye around where your arrow went. So ask yourself, does the person have some personal agenda that could be coloring the outcome? If so you should treat them with the same healthy skepticism you would treat cigarette tobacco studies sponsored by tobacco companies. 
  • Certainty Camouflaging Contingencies: Few things are absolutes, so if someone states something without qualifiers, likely something is being hidden or glossed over -- like the fact that the data is out of date or taking searches of racist terms and jokes as proxies for the person being a racist and then shifting labels from what actually is measured to what the person says is signified by the measurement. This leads to a triple F: Fudging Figures and Facts.

Incidentally, Seth Stephens-Davidowitz takes no chances that you won't recognize him as an expert. Right on p. 1, he declares, "I am an internet data expert." I don't make any such claim, though I have been delving into question of big data since 2011 and regularly review data science student work. But unlike Stephens-Davidowitz, I didn't work for Google. It was actually a team from Google that originally inspired me to write up the piece on not believing everything you see in data visualizations.

The data visualization a the beginning of Everyone Lies( p.13) presents two maps of the US that intends to show a correlation that implies causation. Stephens-Davidowitz refers to having researched correlations of racist searches with voter patterns to argue that Obama lost votes to racism. However the argument he makes about Trump right at the beginning of his book is actually based on an assertion that Nate Silver made in a tweet in early 2016.
Nate Silver's tweet cites an article written by a different Nate with the last name Cohen to bolster his claim. So I went to his source: a New York Times article written by  published on December 31, 2015, Donald Trump’s Strongest Supporters: A Certain Kind of Democrat, and there are the maps  that appears in Stephens-Davidowitz's book.




The maps that are juxtaposed to indicate correlation and imply causation in the article that reappear (in grayscale) in Everybody Lies. In case the caption appears too small for you to read, I'll put it in text: "Source: Vote estimates by Congressional district provided by Civis Analytics; Google search estimates from 2004-7 by Seth Stephens-Davidowitz. How convenient! Stephens-Davidowitz already had that data set from when he gathered it to present evidence of racism at the time of Obama's election. So what if it was really past its sell by date in 2015, never mind in 2017, recycling is a good thing, isn't it?

Aside from the lack of color, there are two other differences in the maps that appear in the book: One it doesn't have the identifier by year. Two: the more cautious label applied in the newspaper illustration of "Where racially charged Internet searches are most common" is replaced by the more confidently asserted "Racist Search Rate." If you suspect that they are, in fact, different maps, I can only tell you to open the book and look for yourself to be assured that I am not misrepresenting anything. This is an example certainty camouflaging contingencies.

Here's my simple Venn diagram of an assertion Stephens-Davidowitz made in asserting in the live presentation, as he did in his book, that the biggest single predictor of a vote for Trump was being a racist. The two circles overlap almost completely.
My own illustration of Seth Stephens-Davidowitz's contention



But if you start looking at the data he used to justify this conclusion, you see it's not at all this simple.


It's true that Nate Cohen is hoping to insinuate the Trump support includes areas that tend to more racist, though he is smart enough to qualify the argument: "That Mr. Trump’s support is strong in similar areas does not prove that most or even many of his supporters are motivated by racial animus. But it is consistent with the possibility that at least some are. "


The article also reflects understanding that things are really not so black and white in Democrat vs. Republican presidential elections: "Many Democrats may now even identify as Republicans, or as independents who lean Republican, when asked by pollsters."


Remember the NY Times' article title? That's the main argument, not really the twist that Silver gave it, as many replies to his tweet pointed out: "Mr. Trump appears to hold his greatest strength among people like these — registered Democrats who identify as Republican leaners — with 43 percent of their support, according to the Civis data"

While the article merely suggests that racial attitude could be involved, Stephens-Davidowitz goes even further than Nate Silver's tweet, asserting that racism is the strongest indicator of a Trump vote. That is what he said in the live presentation I heard on April 19th. At the end of the event, I went up to him and asked how is it possible to link the person who searched for things like racist jokes with votes for Trump.

He admitted that would be impossible. Instead, he said, they look at the areas where Trump won and correlate that with areas where there have been searches he identifies as racist to draw this conclusion that racism was the definitive motivating factor in votes for Trump.

 He indicated that the correlations were made based on the verified fact of which states voted for Trump correlated with the type of Google searches that, he contends, identifies a person as racist, and that was conclusive enough for him.

What he failed to admit was that the correlations were not made on the basis of actual voting results but on earlier maps of projected Trump support.

 A look at the actual map of the election results shows a different story. The predictions included just a fraction of the states that did go to Trump, which means they failed to represent the voters overall, a serious failing in what is presented as comprehensive and accurate data.


Let's take a closer look. The maps paired by Stephens-Davidowitz imply a correlation between his findings of data searches that ended in 2007 with the Trump support assumed to be in place in 2015.So remember all the steps of remove we have here:

  1. 1.We have search data for racial jokes, the n-word, and the like, on which basis we are to assume that all (or at least most) voters of a particular state can be characterized as racist overall if the percentages of such searches are higher than average.
  2. 2. Furthermore, we must assume that in the course of nearly a decade, all the racists stayed in place and retained their views.While such an assumption of stasis may have worked a hundred years ago, it is somewhat doubtful that it can hold in the 21st century when things move at broadband speed.
  3. 3. We have to consider that the overlap of higher support for Trump and higher racism rankings are not just a correlation but an indication of a causal relationship, as he explicitly identified it as an accurate predictor.

The whole theory could possibly be woven together to appeal to those who already favor that outcome, but it doesn't hold water. Even if I'd grant Stephens-Davidowitz that most of the people who had conducted those internet search about a decade before the election stayed in place and did cast their vote for Trump, the visualization would look like this:














Of course, that is not wholly accurate either, as we don't have a clearly established relationship between the people who made racially charged searches back when their search data was collected and the voting citizens of the area in 2016. But this represents the fact that even if some racists are including in the voting pool for Trump, it doesn't define all the voters in that pool. As we can see from the maps of actual voting results with Stephens-Davidowitz's own map of racism, Trump voters were not confined to those states. See the comparison shown by the juxtaposition below.










This reveals that the correlation that Stephens-Davidowitz's points to is not nearly as causative as implied. First of all, some of the places he identified as leaning toward racism in the west actually voted for Clinton. Second of all, the states that did vote for Trump far exceed the ones identified as inclined toward racism. So inclusion of some states with racist searches in the Trump wins is not a definitive correlation because it only includes a portion (not a definitive majority) and fails to account for the voters overall.


Ultimately, what Stephens-Davidowitz's set of maps really shows is not conclusive proof that racism is the best predictor of votes. Instead, what we have in the argument is an illustration of of a confirmation bias that clings to outdated, misleading, and factually wrong representation even when we have access to data that disproves the theory. 

While maps of actual votes were available by the time he published his book,  he kept the map of incorrect predictions as the definitive map of Trump votes because it fits the hypothesis better. It just didn't fit the reality because with that limited support, he would not have won the election.

Sticking with old data sets because they are convenient -- both in terms of saving you research time and in terms of fitting what you want to prove -- is not true data science as it runs contrary to the essential value of science. Richard Feynman touched on this issue in a 1974 address to Caltech entitled Cargo Cult Science in which he explained that true science is about doing one's best "to give all of the information to help others to judge the value of your contribution; not just the information that leads to judgment in one particular direction or another."


This is not to say that we must put Trump on a pedestal.No matter whether you love the president,hate him, or like many others, are somewhat neutral and willing to judge based on results, you should still not distort data to support a particular narrative.


At this point, you may be thinking, "Well that's all politics, but what about the sex in the title?" It's there because Stephens-Davidowitz uses examples related to that to try to capture attention, as demonstrated by what he starts with in his book, his interview (cited above) and the live presentation I heard.

I didn't take on the deconstruction of that, though someone else did. See the second part of Chelsea Troy's review, Everybody Lies’ Review Part 2: Dangerous Methodology She brings up the issue of bad proxies and misleading numbers. I couldn't agree more with what she says here: "Just because there are some numbers floating around doesn’t make a study valid."





Thursday, September 7, 2017

Missingness at the Museum

Pic credit By Ingfbruno - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=29455109


Explanation on the title: Missingness is the term for missing data. I may come across it more than
most people because I've read a lot of blogs on data scraping projects. It occurred to me that it really fits the context for this post.

The museum in question is the AMNH, which is world-famous for its dinosaur exhibits, though not so famous for its pricing structure. This is just one aspect of missingness in place.

The lines to purchase tickets get fairly long at this museum, though this past Sunday was not nearly as bad as another summer Sunday in which the line extended outside and around the block. Perhaps the special exhibit on mummies is not such a great draw. (BTW if you are interested in mummies or anything else Egyptian, you really have to take a trip to Brooklyn to see the exemplary collection of the vastly underrated Brooklyn Museum.)

I'll talk briefly about the problem with pricing information because it is related thematically to missingness, though it is not my main point. As I said, the lines to pay get very long despite the fact that there are various options to purchase tickets without waiting on that line. They include buying them online and buying them at the machines right next to the lines in front of the human cashiers.

On this particular trip, the family behind me on line made two attempts to purchase tickets via machine and then gave up and returned to the line. I noticed only one group that got off the line, purchased the tickets there, and then went straight in. Why is that? Wouldn't everyone want to cut out the wait time and go straight in?

There are a number of reasons why people persist in waiting for humans, but the primary one seems to be confusion. The entrance to the museum offers various ticket levels, from basic, to plus-one, to all-inclusive. Those prices themselves also vary by age and status: adult vs. student and child. But there are two additional factors that complicate the selection even more: One is that some of the "specials," which include both temporary exhibits and films call for times entry. The other is that really the basic admission price is supposed to be "pay what you wish" just like at the Met. However, any time you add on any special, the basic "suggested" price is rolled in.

 You may be willing to forego the specials to knock down your basic admission price from $23, but that's not an option you have when you pay at the machine. It will only accept full payments. It also will not issue you a ticket for showing your Bank of America card on the first weekend of the month. Yes, this museum is among the ones that participate in the Museums on Us program, but if you didn't check this out on your own, you'd have no way of knowing it from your visit in person. Consequently, it seems that people rarely take advantage of the program. In addition, due to the pricing structure in place, the museum does not allow visitors to count the Museums on Us entry as covering the basic cost and allow for an add-on price just for specials, something other museums do allow.

Given the fact that most of the people in the line appeared prepared to pay full suggested amounts, though, it becomes clear to me that they either don't realize that the machines will help them complete their transactions faster or that they want the person to provide information and guidance on the profusion of alternatives available. This is a major flaw in informing the public about how thing work there ahead of time in order to expedite entry.

Now to the main point of missingness, which some people fail to grasp altogether: the missingness in basic numbers that are accepted as the basis of data.

On this trip, we took a guided tour of museum highlights (though we've seen them all before). This guide included a stop in the Hall of Ocean Life, pointed out the blue whale (which you really can't miss) and spoke about how scientists come up with the population numbers now versus what they were in the past. He explained that in the past, when whales were hunted, the numbers were a function of the number killed with an extrapolation for how many must still be out there. Now that hunting whales is illegal, they use other methods to come up with an estimate of the numbers, and so they conclude that the population has diminished.

Now, I recall years ago reading about people who used a similar method to justify catching and killing wild mustangs. They figured that there were several that they didn't see for each one they did. At the time, that approach came under fire from those who considered it to favor the hunters by allowing them overstate the numbers. So if the same was the case for whales, the numbers estimated in the past were likely overstated. Even if there were not, comparing that system of counting with a count that is based on completely different assumption of counting is the proverbial comparison of apples to oranges. In other words, you're mixing two completely different systems with their own sets of missingness to come up with conclusions about numbers, and that is both inconsistent and misleading.


There is a great deal of guesswork in science. Certainly, the guide admitted this in showing the Titanosaur. Not only is it not the actual fossil but a 3D printed replica, but the replica head is based off of a completely different fossil because no head was present. We see things put together as one and assume that they are accurate. But that is often not the case, so we have to bear in mind that even visualizations that appear compelling may not reveal the whole story of the data. Misssingness  can be dealt with, but one has to know which approach was taken and whether that solution contributes to better understanding or pushes to a particular outcome that is not truly objective. For true scientists, getting things to fit alone is not the answer. That's why you see reworkings of dinosaurs exhibits every once in a while. 

Related: http://writewaypro.blogspot.com/2016/10/data-visualization-you-have-to-c-it-to.html

Tuesday, January 17, 2017

Decisions, devices, data, and doctors: should you keep them away?

Some time ago, I read Eric Topol's  The Patient Will See You Now: The Future of Medicine is in Your Hands.  I also read the somewhat less bullish on technology  Robert Wachter's
The Digital Doctor: Hope, Hype, and Harm at the Dawn of Medicine’s Computer Age. Now I've just completed H Gilbert Welch's Less Medicine More Health: 7 Assumptions that Drive Too Much Medial Care with its even less sanguine view of the possibility of generating a lot more data on one's health.

It's good to read all three to get a sense of the developments in the brave new world of digital health (see Healthcare Analysis: Doctor vs. Deviceand why it's not all good.  When someone first told me about Welch's book,I envisioned something like this:

Dr. W: One of the best things you can do to improve your health is to engage in regular exercise. My father, for example, walked 2 miles to and from work each day.
Random person: That's great, how old is he?
Dr. W: He died at 60 from pneumonia he developed after becoming sick from colon cancer.
Random person: ??

That may be your intitial reaction, but if you think about it, you realize his father's early death doesn't disprove his general guidelines for health, which also include the old wisdom of "everything in moderation and nothing in excess." Dr. Welch wasn't claiming that anyone who walks is guaranteed a long life. It is just one of the factors that contributes to good health. Cancer can happen to anyone, and that doesn't disprove the fact that walking is good any more than the smoker who lives to 100 proves that smoking is not at bad for you. People have to remember that there are general rules and loads of exceptions. Dr. W. bets on the rules and what you can do for your health without taking extreme action or obsessing over every bit of health data you can access. 

He certainly offers a contrast to Topol's celebration of increasing patient access to their health data with technology. For example, Topol was thrilled with the fast blood lab analysis offered by Theranos, which has since the book's publication fallen very much out of favor with the public and the law. Topol also consider Angelina Jolie effect a very good thing, a sign of women taking charge of their health. While Welch doesn't say the star was wrong for her own situation, he argues that that kind of testing and pre-emptive surgery doesn't make sense for most people.  

Welch devotes a great deal of his book to the downside of too much data,  not just because of the irrevelant noise, but also because the information it provides can prove more harmful -- in raising anxiety level and prompting invasive actions that don't really improve one's health or wellbeing -- than helpful. This is particularly the case with breast cancer which has been selling "early detection saves lives" to push yearly mammograms on the entire female population, screenings that often raise alarms, prompty biospsies, and sometimes lead to removal of what would not have spread to pose a real threat in any case. 

Some of these issues have already been explored in books like Pink Ribbon Blues: How Breast Cancer Culture Undermines Women's Health (Oxford University Press, 2011). They also have gotten those in the know to change the recommendations for women's mammograms. Nevertheless, the most recent government guidelines for women's health still push that outdated information in its guidelines that allow for regular mammograms for all women 40 and up and that states unabashedly, "The good news is that mammograms can help find breast cancer early. Most women can survive breast cancer if it’s found and treated early." This dangles a false promise of saved lives that often were not in danger at all and completely ignores the harm that can result, something more and more experts are admitting as studies like this one covered by PBS, "One in three women may receive unnecessary mammograms, study says" come to light.

What's true of screening for breast cancer is also true for other forms of screening that lead to invasive tests and treatments in the attempt to "fix" problems that would not cause any ill effects if just left alone.  But even when the screening doesn't necessarily entail harm, Welch says, we should ask if it does actual good. This is important to know because the right to say "no" to  a suggested test  because there is no benefit to be derived because the information is not going to be actionable in any case is empowering for patients or their caregivers.

Here's a case in point: a couple of years ago, I brought my son in to a doctor when he had signs of a cold just to be sure it wasn't strep or something else that would require medication. The doctor decided to also test him for flu. Though both rapid tests were negative, he wanted to be sure and put in for overnight lab test for both. They, too, were negative. Now here's the thing: it may have made sense to do the strep test in case the rapid was inaccurate because someone with strep should take antibiotics, but the extended flu test made no sense at all because the results take days, and by then 1) it's too late to try to take Theraflu or any other prescribed medication to mitigate symptoms and 2) you'd know you'd have the flu or not yourself at that point based on the extent of your suffering. So the doctor had put in for a test that cost over $10

0 (not covered by insurance because after ACA went into effect, it added on a deductible for all diagnostic labs)  with no tangible benefit for the patient. The only ones who stood to benefit from the lab data are the people in NY state who collect data on flu. But they were not the one given the bill.  

It's very hard for some of us to resist the recommendations of doctors for tests, treatments, etc. That's because we have to break through our own biases that convince us the doctors know what they're doing and always acting in our own best interests. That's not to say that doctors are completely ignorant or that they are deliberately jacking up their incomes with more procedures (though some are or do them to cover themselves in case of suits)  but that they are conditioned to automatically run these tests and make the standard recommendations in a one-size-fits-all approach to medicine. It's up to individuals to get informed and empowered.

Wednesday, February 27, 2013

Dating Homer


From Geneticists Estimate Publication Date of  the "Iliad."
Of course, publication is not exactly the term one would use for an oral work, which, as the research shows seemed to have grown out of various other oral traditions that go back another 500 years or so before the "publication" date. Still, the language itself served as the bread crumbs that mark the trail of origins to when the compilation of stories known as the Iliad became set in the form that has been passed down to generations.
  
"Languages behave just extraordinarily like genes," Pagel said. "It is directly analogous. We tried to document the regularities in linguistic evolution and study Homer's vocabulary as a way of seeing if language evolves the way we think it does. If so, then we should be able to find a date for Homer."

The date they arrived at was 763 BCE, give or take 50 years.

The researchers employed a linguistic tool called the Swadesh word list, put together in the 1940s and 1950s by American linguist Morris Swadesh. The list contains approximately 200 concepts that have words apparently in every language and every culture, Pagel said. These are usually words for body parts, colors, necessary relationships like "father" and "mother."They looked for Swadesh words in the "Iliad" and found 173 of them. Then, they measured how they changed.
 They took the language of the Hittites, a people that existed during the time the war may have been fought, and modern Greek, and traced the changes in the words from Hittite to Homeric to modern. It is precisely how they measure the genetic history of humans, going back and seeing how and when genes alter over time.


Wednesday, September 19, 2012

Representing 100 years of childhood


Representing 100 years of childhood
At the rate of 2.5 quintillion bytes of data a day, we have created 90 percent of the data we have in just the past two years. And while 10 percent sounds small in comparison, working with the data of the past presents the same challenges as any Big Data project. You have to consider what to include and what to exclude to come up with the questions, correlations, and contexts that relate to your concerns. They are key to the representation of your data, whether in the form of a report, an inforgraphic, or a physical exhibition.
Some of those essential components were missing in  the  Century of the Child: Growing by Design, 1900–2000 exhibit at the Museum of Modern Art.  It’s described as “the first large-scale overview of the modernist preoccupation with children and childhood as a paradigm for progressive design thinking. The exhibition will bring together areas underrepresented in design history and often considered separately, including school architecture, clothing, playgrounds, toys and games, children’s hospitals and safety equipment, nurseries, furniture, and books.”
Granted, it is impossible to show everything. Yet, I question the omission of an American Girl Doll. (The company that produced it was sold to Mattel  in 1998 for $700 million). The line was introduced in 1986 and was considered a significant departure from the Barbie style that had dominated the doll market at the time. These dolls represented girls rather than full-figured adults and offered some historical insight with their accompanying books. When they first came out, the $100 dolls also raised questions about how much parents are expected to spend on toys, something worth bringing up in relation to consumerism.
Of course, people get a nostalgic kick out of seeing the toys and furniture they associate with their own childhoods, like classic wooden and Lego blocks, an Erector set, an Etch-a-Sketch, a Rubik’s cube, a Slinky, and a Barbie house.  Still, the toys should have offered more than a trip down memory lane. While  the exhibit points to the rather obvious cause for the proliferation of toys associated with the space age, it does not explore how other toys were also a product of their times.
Aside from what different types of toys represent, there is the evolution within toy lines to consider.  For example, Lincoln Logs also started incorporating plastic and premade windows into its sets.  Tinker Toys evolved from simple wooden forms to plastic ones that included specialized pieces and set in pastel colors that were marketed to girls. These modifications raise questions about materials, imaginative play, and gender that should be considered in such an exhibit.
Without context and explanations, you just have random items that do not signify meaning. As Jean Aggasiz said, “Facts are stupid things until brought in connection with some general law.” The same holds true for data, no matter how big.