Data Viz Revision: O’Reilly 2016 Data Science Salary Survey (Part 3)

This post is part of a series based on the data displayed in O’Reilly’s 2016 Data Science Salary Survey. Using the Data Chefs Revision Organizer as a guide, we will rethink and revise some of the visualizations featured in the report.

In this visualization, the authors are trying to show the proportion of  survey respondents based on their location in specific regions of the world:


The blue circles do not depict the underlying data in this map, as they did in the visualizations from the first two posts in this series.  Instead, the blue bubbles here are merely a stylistic choice: they serve as pixels representing the world’s land mass. The numeric values are then laid on top of their corresponding regions.

It’s important to note that while all the categories are regional, the units vary. Sometimes they refer to countries (e.g., the United States, Canada), sometimes to entire continents (e.g., Africa, Asia), and sometimes to vague regional groupings (e.g. Latin America). Given the inconsistency in the data categories, it’s no surprising that the visualization is a little unclear too.

One of the problems with this visualization is that the values are represented as numbers, so the reader does not immediately notice the difference between the size of the values.  If you move back a little bit or squint your eyes until you can’t quite read the exact values, there’s nothing that immediately distinguishes the highest value (United States) and the lowest (Africa). Both appear to be white text that takes up roughly the same amount of space on a blue grid.

As I considered how to revise this map, my first thought was to try to salvage the blue bubble theme by using blue bubbles sized based on the values and placed over a geographic map.  Here’s a mockup I did using carto:


And here’s one I did using PowerBI:


While you can immediately see the size difference in values on these revisions, this type of map still has the same issue as the original, namely, confusion caussed by inconsistent geographic categories.  What countries constitute “Latin America,” for instance? If we assume that a number of the Caribbean island nations are part of Latin America, then it seems a little weird that the value is placed in the middle of South America.  Using another example, respondents from Iceland probably fall under Europe/non-UK, but there’s a disconnect (literally), because the  value bubble is all the way in mainland Europe.

There’s also a secondary problem that arise from the limitations of the tools I used: PowerBI and carto. If you look in my examples, the bubbles are not sized consistently.  In both tools, it’s difficult to make bubble maps in which the size of the circles accurately reflect area, not diameter.  For these reasons, I ruled out the bubble map.

Next, I considered a part/whole visualization, like the ones in part 2, but the fact that there are eight distinct categories, and some of the values are relatively small, I knew that there would be issues seeing the smaller values and their labels.

So, ultimately, I settled on this revision:





It’s just a simple bar chart, with values ranked from highest to lowest.  The benefit of using this simple graph, rather than the map, is that it elimiates the confusion caused by the inconsistent units of the regional categories. Now, because we don’t see every country on this chart, we don’t worry about it.

This may not be as visually appealing as the original, but, sometimes, the simplest solution is the best solution.

On Using “Racial” Color Categories in Data Visualization

ethnicity americas.png


Today I came across this map depicting the ethnic composition in the Americas (h/t Randy Olson).

It rubbed me the wrong way immediately, and I’m not talking about its use of multiple pie charts. It brought to mind these examples from Stephanie Evergreen (via Vidhya Shanker).

The first issue is that the categories themselves seem arbitrarily defined, and there is a conflation of ethnicity and race. For instance, “Mestizo,” “Mulatto,” “Garifuna,” and “Zambo” are all multiracial or multiethnic groups, yet, there is also an “Other, Multiracial, Mixed” category.

ethnic categories.png

Complicating matters further, as a result of legal classifications regulating those of African descent, in some places (e.g. the United States), a number of those categorized as “Black” have been of multiracial descent.  Moreover, the title refers specifically to “Ethnic Composition,” but “Black” and “White” are technically not ethnicities.

This gets me to the terms themselves. While the Spanish term “mulato” may still be acceptable, the English “mulatto” is definitely no longer considered appropriate (and is often considered a racial slur), yet there it is representing people in The U.S. and Canada.

Then, there is the most glaring problem, in my view: the use of symbolic racial category colors to represent the different groups.  I’m sure the thought behind it was to use immediately recognizable colors to limit confusion, but in data visualization, a good practice is asking if the benefits of familiarity outweigh the costs.

What are some of those costs? Specifically, using one stylized, but supposedly realistic “racial” color to represent each group brings us back to the earlier point about conflating race and ethnicity.  It also takes groups that contain people with a wide range of skin tones and represents each one using only one shade. This is not a problem when the colors are abstract (see this racial dot map featured below–no African Americans actually have green skin, for instance). But when it’s supposed to represent real people, it feels both reductive and exclusive.

racial dot map.png

And what are we supposed to make of the fact that most of the racial colors are “realistic,” except for the bright Red of “Native American” and the yellow of “East Asian, East Indian, Javanese” category?

There has also been some pushback against this kind of “familiar” color categorization with respect to sex and gender.

Bottom line: data visualization is all about deliberate choices and tradeoffs. When confronted with “sensitive” data, it’s a good idea to ask yourself, “Could the  choices I’ve made offend people?”

Let’s say you have an aversion to this kind of framing: to “offense” as a legitimate constraint.  That’s fine. In that case, I’d suggest you modify the question to “Could the presentation & classification choices I’ve made distract from the content?”


Data Viz Revision: NIMSP’s Campaign Contributions Disclosure Scorecard


Earlier this year, the National Institute on Money in State Politics (NIMSP) created a brilliant campaign contributions disclosure scorecard depicting each US state’s grade.

They used Tableau to visualize the data as a geographic state choropleth and labeled each state with its scorecard grade (A through F):

Sunlight Scorecard


There are a few problems with visualizing this data in this way.  First, there’s no way to see Washington, DC on this map. Also, on this map, the grade labels are redundant because the legend already shows what color represents each grade. But the biggest problem is that using a geographic map creates scale issues: you have to scroll and/or zoom to see the small states in the Northeast (e.g., Delaware, Rhode Island), not to mention Alaska and Hawaii.

The scorecard map is showing which states got which overall grade, so it doesn’t matter how big each state is.  For the purposes of this data, the states have equal weight. To try to solve this scale issue, I decided to revise the map as an abstract state map, with every state depicted using a figure of the same size and shape (I was inspired by this post by Danny DeBelius from NPR’s Visuals Team).

I decided to make a square tile map in Excel, applying conditional formatting on the state squares to get the different colors.  I downloaded the data from NIMSP’s Tableau file, then used this awesome tutorial by Caitlyn Dempsey at GIS Lounge. Here’s what my Excel square tile map looks like:

NIMSP Excel Scorecard


I think the legend is a bit too big (It’s not that easy to make the cells smaller. You’d probably have to make all of the cells tiny, and merge them to get the bigger state squares).

Overall, the Excel tile map is fine, but I think it lacks a bit of visual pop.

Next, I decided to try make the same tile map in Tableau, only with hexagons instead of squares.   I followed Matt Chambers’ user-friendly tutorial at Sir-Viz-a-Lot and made the following map:

NIMSP Scorecard Hex Tableau.jpg

Just for kicks, I switched the tiles back to squares instead of hexagons, and got what looks like a keyboard.  With this tile configuration, I think the hexagons look much better.

NIMSP Scorecard Square Tableau.jpg

Finally, I incorporated a little from Keir Clark of Maps Mania, Andrew W Hill from the CartoDB Team, and the Danny DeBelius NPR post I mentioned earlier.  Here’s the result:

NIMSP Scorecard Hex CartoDB2.png

So, what do you think?  Which revision do you think is most effective?