A case study in messy data analysis: the Australian same-sex marriage survey

Last month the Australian people signaled their approval of legalizing same-sex marriage by a 62%:38% margin in a national survey. (On a personal note, I was elated and relieved by the result: my husband and I have discussed eventually retiring to Australia, and with this decision our marriage would be recognized there.) While fears of a surprise Brexit-like electoral backlash proved unfounded, researchers including R user Miles McBain explored the results for correlations to demographic variables. This process wasn't as simple as it might have been though: the Australian Bureau of Statistics released the results as a pair of Excel files that violate just about every good practice for sharing data in spreadsheets:


Miles shares the R code he used to extract useful data from this spreadsheet as a blog post that makes a great case study in dealing with messy data using R. The post demonstrates how he used the read_excel function (from readxl package) to extract specific sub-tables from the spreadsheet by specifying row and column ranges, and then use the dplyr package to clean up and merge the data. If you want to explore the data yourself, you can find the R code and the source data in this Github repository.

In a follow-up post, Miles combines the same-sex marriage survey data with Australian Census data to explore various demographic relationships. Unlike the US Census data (which is easily accessible in R thanks to the tidycensus package), there's no interface package for Australian Census data. (Selected tables are available in the Census2016 package, however.) Instead, Miles demonstrates how to use R to download and extract data from the the "Census DataPacks" (CSV data files and Excel data dictionaries) provided by the Australian Bureau of Statistics.  Yet more data wrangling allows Miles to create summary charts of the responses, such as this chart of proportion voting No by percent of the district population declaring a religious affiliation, broken down by state. As you may expect, those districts with more religious populations voted No at greater rates.


Both of these post provide great examples of working with government data, which is often provided in inconvenient formats with messy structures. Follow the links below for step-by-step guides, including the R code used to extract the data, structure it for analysis, and create useful charts.

Medium (Miles McBain): Tidying the Australian Same Sex Marriage Postal Survey Data with R; Combining Australian Census data with the Same Sex Marriage Postal Survey in R