Using Canada's census data to map and analyze the overlaps and divergences between ethnic and national populations from the Horn of Africa in Canada. Somalis, Ethiopians, Oromos. Mapping in R, data analysis in R, migration statistics.
This script accompanies a journal article. If using the data or procedure, please cite the original article in which results were reported: D. K. Thompson (2018). Ethnicity and Nationality among Ethiopians in Canada's Census Data: a consideration of overlapping and divergent identities. Comparative Migration Studies, 6/6.
Introduction
This description outlines the procedure I followed to visualize and analyze a segment of Canada’s census data relating to ethnicity and birthplace for populations tracing their origins to the Horn of Africa. As part of my job is teaching social scientists how to use R software to generate such visualizations and analyses, this is meant to be a teaching script that requires a little of the reader’s own work and coding rather than simply a copy/paste and run approach to replicating the study. I give code examples of every procedure used in the study, but I do not necessarily repeat the examples for every year of data, leaving it instead to the reader to replicate some of that code based on the examples. I hope that readers who are less familiar with data management and analysis in R will take advantage of these examples if they are pursuing their own similar projects. Besides intending this document as a teaching document, however, I also believe that replicability is key to good science and I hope someone out there will test my conclusions and improve on my analysis.
A warning: My training is in field anthropology and ethnographic methods, not computer science. Most of my research uses a qualitative approach. This to say, please excuse my code where it is inelegant (which it may be throughout its entirety).
Part 1: Downloading and organizing Canada’s census data for analysis in R
Tables listed were downloaded from Canada’s census website as Beyond 20/20 files, which require users to download Beyond 20/20 software. This format enables user interface with multidimensional data, parts of which can then be selected and output as two-dimensional spreadsheets for analysis.
Example of translating data from Beyond 20/20 to .csv: Select the row with the data you want more details on.
From the grey list of variables in the bar above the spreadsheet, you can drag and drop one into the rows and one into the columns of the spreadsheet. So, for example, after highlighting Ethiopia, if I drag the grey “Immigrant status” bar down to the columns (where the screenshot above shows citizenship), and the “Geography” bar down to the rows (Place of Birth in the screenshot above), then we come with something like this:
a<-read.csv("Canada_1996.csv", stringsAsFactors = FALSE)
a<-a[c(175,177,202,214,615:617,690:692,765:767),]
a$Profile.of.CMA. = as.character(a$Profile.of.CMA.)
a[2,1]<-"Somalia_imm"
library(reshape2)
a<-melt(a) #This makes the columns into rows in a new column called "variable",
#and places cell values in a new column called "value."
colnames(a)[1]<-"1996"
a<-reshape(a, direction="wide", idvar = "variable",
timevar="1996") #This makes the values in the first column, renamed 1996, into new variables (columns)
colnames(a) = c("Geography", "Kenya_bn_1996", "Somalia_bn_1996",
"Somaliabn_recimm_1996", "Ethiopiabn_recimm_1996",
"Somali_total_1996", "Somali_single_1996", "Somali_mult_1996",
"Ethiopian_total_1996", "Ethiopian_single_1996", "Ethiopian_mult_1996",
"Eritrean_total_1996", "Eritrean_single_1996", "Eritrean_mult_1996")
a$Geography<-as.character(a$Geography)
Then I reorganize some columns, using the gsub() function to separate out some unique identifiers from the geography column. To see details of gsub commands, click HERE.
a$Geoid<-gsub("[A-z]", "", a$Geography)#Create a unique "Geoid" field for each geographical area
a$Subgeo<-gsub("[0-9]{8}", "", a$Geoid)
a$Subgeo<-gsub("\.", "", a$Subgeo)
a$Geoid<-gsub("$", "B", a$Geoid) #Place a random marker at the end of sequence I want to delete
a$Geoid<-gsub("\.[0-9]{5}B", "", a$Geoid) #Delete every sequence of 5 numbers followed by my marker
a$Geoid<-gsub("[^0-9]", "", a$Geoid) #Get rid of everything in Geoid column except numbers
a$Subgeo<-gsub("[^0-9]", "",a$Subgeo)
a$Geotype<-gsub(".*00000", "primary", a$Geoid)#Distinguish primary CMAs from their subdivisions (relatively few subdivisions in this set)
a = a[,c(1,15:17,2:14)] #Reorder columns to get ID variables up front
Within this data there are some duplicates since there are subdivisions within some of the census geographies. These are given by the geographical codes, so I identified the primary census geographies as “primary” using the geography code variable I created. Then I subset the data and calculate the total of each column. This will give the total recorded number of each population group in Canada as a whole. I append the column sums from b back into a, so I can keep the subdivision data but also have a row with the summed data for all of Canada. I then write a CSV file with the rearranged data.
b = a[a$Geotype=="primary",]
b["Total",] = c("Total", NA, NA, NA, colSums(b[,5:17]))
a["Total",] = b["Total",] #So as not to duplicate data from CMA sub-divisions
write.csv(a, "Canada1996_CMAs.csv")
Since this is meant to be a teaching script, I’m omitting the specific script to reorganize the data for 2001-2016. You can use the script for 1996 above as a model and make minor adjustments since the data is often a slight bit different year to year. Once I’m done, I write a CSV for each year.
Part 2: Immigration charts
Before getting into the data analysis, the first thing I want to show is how to generate charts and maps. Figure 1 in the article was created in ArcGIS, not R, and is basic, so let’s start with figures two and three. For these, the 2006 and 2016 data include a breakdown of migration by country of origin for five-year intervals. The two years have some overlap, but importantly they do not include all of the same 5-year intervals: 2006 includes 5-year estimates for 1991-1995 and 1996-2000, while 2016 includes 5-year estimates only for 2001-2005, 2006-2010, and 2011-2016 (yes, I know, 6 years…).
So, assuming we’ve put together all the tables from the different immigrant groups we want, which we exported from Beyond 20/20, we’ll have a spreadsheet with rows as geographical units (CMAs) and columns as populations of different immigrant groups. We want to select only Canada and provinces, so we’ll take all rows for which the id is less than 100. In the column names, I have specified the country and the years. We melt the data into long format, and then draw out separate columns for the country, the begin date of the count, and the end date of the count. Rather than “before 1991” that the data includes, let’s just assign a beginning date of 1986 to this data. This will come in handy when we want to select only 5-year estimates to plot. For my 2006 data, I originally created abbreviated country names, so we’ll need to adjust this with some gsub() work.
a<-read.csv("I://R/US_Census/Canada_data/Canadamaps/Canada_reviseddata2006.csv")
a = a[,-c(1)]
b = a[a$id<100,]
library(reshape2)
b = melt(b, id.vars= c("Geography", "id"))
library(stringr)
b$begin = str_replace(b$variable, "_", "!!")
b$end = gsub(".*\\_", "", b$begin)
b$end = gsub("[^0-9]", "", b$end)
b$begin = gsub(".*\\!\\!", "", b$begin)
b$begin = gsub("\\_.*", "", b$begin)
b$origin = gsub("\\_.*", "", b$variable)
b$origin = gsub("Af", "Africa", b$origin)
b$origin = gsub("EA", "Eastern Africa", b$origin, ignore.case=FALSE)
b$origin = gsub("Dj", "Djibouti", b$origin)
b$origin = gsub("Et", "Ethiopia", b$origin)
b$origin = gsub("Er", "Eritrea", b$origin)
b$origin = gsub("Ke", "Kenya", b$origin)
b$origin = gsub("So", "Somalia", b$origin)
b$end = as.numeric(b$end)
b = b[!is.na(b$end),]
b$begin = gsub("bf1991", "1986", b$begin)
b$begin = as.numeric(b$begin)
Now we want to exclude 10-year estimates, so based on the begin dates and end dates we can calculate a span and then select dates that exclude 10-year estimates. In order to put the data together with 2016 data, I also create a “Geolabel” column that is a shortened version of the “Geography” column, excluding all numbers. I then rename the data so I can just copy/paste the same procedure for 2016. I’ll let you try your hand at that.
b$span = b$end-b$begin
b$Geolabel = gsub("\(.*", "", b$Geography)
b$Geolabel = gsub("\s$", "", b$Geolabel)
c = b[b$span<8,]
c2006 = c #Rename so that we can use the same code for
#2016 data and then put them together.
Now getting the 2016 data in the same format. Something like this should do:
a<-read.csv("Canada_2016_CDs.csv", stringsAsFactors = FALSE)
a = a[,-c(1)]
a$Geoid = as.numeric(a$Geoid)
a = a[,c(1,2,28:82)]
b = a[a$Geoid<100,]
library(reshape2)
b = melt(b, id.vars= c("Geography", "Geoid"))
library(stringr)
b$begin = str_replace(b$variable, "_", "!!")
b$end = gsub(".*\\_", "", b$begin)
b$end = gsub("[^0-9]", "", b$end)
b$begin = gsub(".*\\!\\!", "", b$begin)
b$begin = str_replace(b$begin, "_", "!!")
b$begin = gsub(".*\\!\\!", "", b$begin)
b$begin = gsub("\\_.*", "", b$begin)
b$origin = gsub("\\_.*", "", b$variable)
b$end = as.numeric(b$end)
b = b[!is.na(b$end),]
b$begin = gsub("bf1981", "1976", b$begin)
b$begin = as.numeric(b$begin)
b$span = b$end-b$begin
b$Geolabel = gsub("\\(.*", "", b$Geography)
b$Geolabel = gsub("\\s$", "", b$Geolabel)
c = b[b$span<8,]
c2016 = c
colnames(c2006)[2] = "Geoid"
c2006 = c2006[c2006$end<2006,]
unique(c2006$end)
Once we have the 2006 and 2016 data in the same format and excluding 10-year estimates, we can just bind the rows together using the rbind() function. In order to do this, make sure that columns match for each data frame.
Although it’s not included in the article, since we kept Canada’s provinces, we could look at immigration of various groups to the provinces.
c = rbind(c2006, c2016)
library(ggplot2)
gg2 = ggplot(c[c$origin=="Somalia",],
aes(x=end, y=value, group=Geolabel))+
geom_line(aes(color=Geolabel))
gg2
gg3 = ggplot(c[c$origin=="Ethiopia",],
aes(x=end, y=value, group=Geolabel))+
geom_line(aes(color=Geolabel))
gg3
But what I mainly want is to look at a select set of groups in relation to Canada as a whole, so we can subset to exclude Africa and Eastern Africa, and subset to include only Canada. When putting together this data, there is an additional complication. The 2016 data includes pre-1980 groups, whereas these are included in the 1991 estimate in the 2006 dataset. Since the 2016 data does not include 5-year estimates for 1991, let’s take out the earlier data and just start with the 5-year estimates for 1991.
d = c[c$Geolabel=="Canada"|c$Geolabel=="Ontario",]
d = d[!d$origin=="Africa",]
d = d[!d$origin=="Eastern Africa",]
e = d[d$Geolabel=="Canada",]
e = e[!is.na(e$Geography),]
e = e[e$end>=1991,]
And here’s a little ggplot code to get you Figures 2 and 3, first showing the actual number of immigrant arrivals and then showing a cumulative plot of populations based on these figures of arrivals.
####Plot of pops based on 2016 census####
gg4 = ggplot(e[e$origin=="Somalia",],
aes(x=end, y=value, group=origin))+
geom_line(aes(linetype=origin), lwd=0.75)+
geom_line(data=e[e$origin=="Kenya",],
aes(linetype=origin), lwd=0.75)+
geom_line(data=e[e$origin=="Ethiopia",],
aes(linetype=origin), lwd=0.75)+
geom_line(data=e[e$origin=="Eritrea",],
aes(linetype=origin), lwd=0.75)+
theme_bw()+
scale_linetype_manual("Origin",values=c(1,2,3,4))+
labs(y="Arrivals", x="Year")+
scale_y_continuous(breaks=c(2500,5000,7500,10000,12500),
labels=c("2,500", "5,000","7,500","10,000", "12,500"))+
scale_x_continuous(breaks=seq(from=1991, to=2016, by=5))
gg4
gg4 = ggplot(e[e$origin=="Somalia",],
aes(x=end, y=cumsum(value), group=origin))+
geom_line(aes(linetype=origin), lwd=0.75)+
geom_line(data=e[e$origin=="Kenya",],
aes(linetype=origin), lwd=0.75)+
geom_line(data=e[e$origin=="Ethiopia",],
aes(linetype=origin), lwd=0.75)+
geom_line(data=e[e$origin=="Eritrea",],
aes(linetype=origin), lwd=0.75)+
theme_bw()+
scale_linetype_manual("Origin",values=c(1,2,3,4))+
labs(y="Cumulative pop. (thousands)", x="Year")+
scale_y_continuous(breaks=seq(from=5000, to=40000, by=5000),
labels=c("5", "10", "15", "20", "25", "30", "35", "40"))+
scale_x_continuous(breaks=seq(from=1991, to=2016, by=5))
gg4
Part 3: Maps
R is capable of creating excellent maps, but it can take a little practice to get the hang of projections, plotting different spatial data sets over each other, and other functions. I’ll just walk through a basic example of code used for maps in this article and point you to some resources that I have found helpful.
We’re going to need a few additional packages to project our spatial data. rgdal will allow us to read in shapefiles and manipulate them. You can either set a working directory where shapefiles are located or just read them in straight from wherever they are (which I do here, since I keep them in a different file from my statistical data).
library(rgdal)
candiv<-readOGR(dsn="G://R/US_Census/canada_divisions/canada_divisions.geojson", layer="OGRGeoJSON")
canprov<-readOGR(dsn="G://R/US_Census/canada_provinces/canada_provinces.geojson", layer="OGRGeoJSON")
usstates-readOGR(dsn="G://R/US_Census/GeoJSON/states.geojson",layer="OGRGeoJSON")
</code
Now, the shapefiles/GeoJSON come in a sort of raw geographical format that is not projected, and also not very useful for plotting statistics in ggplot, which is what I want to use them for. So we’ll create a projection for the shapefiles and make them into spatial data by applying this coordinate reference system (CRS) to the data. In this case, I want to use an Albers equal-area projection. You can play around with the parameters if you like, but I wrote this one specifically to get what I see as a decent projection that centers Canada when it is drawn and doesn’t distort the northern parts of the country. The steps to get from shapefile to ggplot are several, as I understand it (I’m professedly not an expert in this area, but this code will get the job done). The fortify() command will get the data into a data frame format that ggplot can read. I set the coordinates to zoom in a little. Plot the map to make sure it’s working properly.
library(rgeos)
library(ggplot2)
# Albers equal-area projection parameters
aea.proj <- "+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96
+x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m"
candivmap<-(spTransform(candiv, CRS(aea.proj)))
canprovmap<-(spTransform(canprov, CRS(aea.proj)))
usstatesmap<-spTransform(usstates,CRS(aea.proj))
canprovmap<-SpatialPolygonsDataFrame(gBuffer(canprovmap, byid=TRUE,
width=0),
data=canprovmap@data)
candivmap<-SpatialPolygonsDataFrame(gBuffer(candivmap, byid=TRUE,
width=0),
data=candivmap@data)
usstatesmap<-SpatialPolygonsDataFrame(gBuffer(usstatesmap, byid=TRUE,
width=0),
data=usstatesmap@data)
canprovmap_map<-fortify(canprovmap, region="PRUID")
candivmap_map<-fortify(candivmap, region="CDUID")
usstatesmap_map<-fortify(usstatesmap, region="GEOID")
gg<-ggplot()+geom_map(data=candivmap_map, map=candivmap_map,
aes(x=long, y=lat, map_id=id),
color="darkgray", fill="white",lwd=0.1)
gg<-gg+geom_map(data=usstatesmap_map, map=usstatesmap_map,
aes(x=long,y=lat, map_id=id),
color="black",fill=NA,lwd=0.2)
gg<-gg+geom_map(data=canprovmap_map, map=canprovmap_map,
aes(x=long,y=lat,map_id=id),
color="black",fill=NA,lwd=0.2)
gg<-gg+coord_fixed(xlim=c(-2700000,2900000),
ylim=c(600000,4000000))
gg
This is a good background map, but we’ll want CMA boundaries in order to plot census data. Here, gmc000b11a_e is a shapefile downloaded from Statcan's website. It's included in the dataverse files, but note its different format from the GeoJSON files used above.
cma<-readOGR("G://Map_Projects/Canada_census/gcma000b11a_e")
cmamap<-spTransform(cma, CRS(aea.proj))
cmamap<-SpatialPolygonsDataFrame(gBuffer(cmamap, byid=TRUE,
width=0),
data=cmamap@data)
cmamap_map<-fortify(cmamap, region="CMAPUID")
gg<-gg+geom_map(data=cmamap_map, map=cmamap_map,
aes(x=long, y=lat, map_id=id),
color="darkgray", fill="white",lwd=0.5)
gg
This looks good to go. Now we just need some data. Read in the 2006 and 2016 data and create some identifiers. So let’s read in the data. Then we need to match up the 2006 and 2016 data in a format that can merge with the CMAs shapefile. The 2006 data have the geographic ID of each CMA, while the 2016 data include only the last digits of this number (the first two are the province number). So we’ll create a new column in the 2006 data called “Geoid” that removes the first two digits (and replaces some of the numbers that would otherwise become single-digit). Then we will create a “Nameshort” column that takes the first word of each geographic place name. Between these two, we should be able to merge the 2006 and 2016 data. Then we will use the “id” column in the 2006 data to merge the resulting dataset with the CMA shapefile.
a<-read.csv("G://R/US_Census/Canada_data/Canadamaps/Canada_reviseddata2006.csv", stringsAsFactors = FALSE)
a = a[,-c(1)]
b = read.csv("Canada_2016_CDs.csv", stringsAsFactors = FALSE)
b = b[,-c(1)]
a$Geoid = gsub("^[0-9]{2}", "", a$id)
a$Geoid = gsub("^00", "10", a$Geoid)
a$Geoid = gsub("^01[0-9]{1}", "111", a$Geoid)
a$Nameshort = gsub("\W.", "",a$Geography)
b$Nameshort = gsub("\W.", "", b$Geography)
z = merge(a, b, by=c("Nameshort", "Geoid"))
amap<-merge(cmamap_map, z, by="id")
Now for some parameters to set up the maps. First, breaks to use as categories for population maps – I like the palettes ggplot2 offers in scale_fill_brewer(), so I want to set this up for these color ranges, meaning that 7-9 classes is ideal. Then I set up a style guideline called “ditch_axes”—I did not come up with all of this myself, and thanks to whoever did and whose blog or website link I have misplaced. Next, I want to create some labels. Since we will color the populations based on the breaks we created, we first “cut” the category we want to map, which will place all of the column values into one of the classes. We then want labels for each class. The code creating the list “labels” should do it, and modifications of this code segment can be used for a variety of purposes.
breaks = c(0.9,10,50,100,500,1000,5000,8000,Inf)
ditch_axes<-theme(axis.text = element_blank(),
axis.line = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
panel.background=element_rect(fill="white"),
legend.key.size=unit(0.35, "cm"))
amap2 = amap
amap2$Somali_total_2016 = cut(amap2$Somali_total_2016, breaks=breaks)
amap2<-amap2[order(amap2$Somali_total_2016),]
labels = as.character(unique(amap2$Somali_total_2016))
labels = gsub("\(", "", labels)
labels = gsub("\]", "", labels)
labels = gsub(",", "-", labels)
labels=gsub("e\+03", ",000", labels)
labels=gsub("e\+04", "0,000", labels)
labels=gsub("0.9", "1", labels)
labels=gsub("-Inf", "+", labels)
labels<-c(labels[1:6],"5,000-8,000",labels[7:8])
Finally, we’re ready to draw the map. For the geospatial data to draw correctly, we need to make sure it’s in the right order—which is listed for shapefiles in a column called “order”. Then we just plug into the map “gg” that we already created. You can play around with layering different things on top or underneath, and there are also parameters for opacity (“alpha”), and much more. The map of Canada is difficult to read because Canada is so big and much of it is remote islands in the northern part. So after setting up the map, we can create maps of zoomed-in areas using the “coord_fixed” command.
amap2 = amap2[order(amap2$order),]
gg3<-gg+geom_map(data=amap2, map=amap2,
aes(x=long, y=lat, map_id=id),
color="black", fill=NA,lwd=0.1)
gg3<-gg3+geom_polygon(data=as.data.frame(amap2),
aes(x=long, y=lat, group=group,
fill=cut(Ethiopian_Total, breaks=breaks)))+
scale_fill_brewer(drop=FALSE, palette="YlGn",
name = "Total Ethiopian\npopulation, 2006",
guide=guide_legend(reverse=TRUE),
na.value="white", labels=labels)
gg3
gg3<-gg3+geom_map(data=usstatesmap_map, map=usstatesmap_map,
aes(x=long,y=lat, map_id=id),
color="black",fill=NA,lwd=0.2)
gg3<-gg3+geom_map(data=canprovmap_map, map=canprovmap_map,
aes(x=long,y=lat,map_id=id),
color="black",fill=NA,lwd=0.2)
gg3 = gg3+ditch_axes
ggzoom<-gg3+coord_fixed(xlim=c(600000,1800000),
ylim=c(600000,1200000))
ggzoom
ggzoom2<-gg3+coord_fixed(xlim=c(-2200000,2600000),
ylim=c(600000,2100000))
ggzoom2
Now you can go back and re-do this for other years and measures, or write a loop and print out a bunch of maps at once with different measures in the data. For Figures 4-6 in the article, this was the procedure used to create multiple plots—and then a neat little snippet of code to put them in the same jpeg, courtesy of Cookbook for R “Multiple graphs on one page (ggplot2)”. When you print the plots for wide maps such as those I created, you will specify one column in the jpeg output.
multiplot = function(..., plotlist=NULL, file, cols=1, layout=NULL){
library(grid)
plots<-c(list(...), plotlist)
numPlots = length(plots)
if(is.null(layout)){
layout<-matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow=ceiling(numPlots/cols))
}
if(numPlots==1){print(plots[[1]])
}else{
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
for(i in 1:numPlots){
matchidx <-as.data.frame(which(layout==i, arr.ind=TRUE))
print(plots[[i]], vp=viewport(layout.pos.row=matchidx$row,
layout.pos.col = matchidx$col))
}
}
}
multiplot(ggzoom2, ggzoom4, cols=1)
jpeg("Ethiopians_2006_2016.jpg", width=8, height=4, units="in", res=500)
print(multiplot(ggzoom2, ggzoom4, cols=1))
dev.off()
Same procedure, but just changing the variables and some of the output parameters will get you three maps on the same page. Note also that the breaks and labels used in the maps for Amhara and Oromo diaspora below are different, since the populations as reported in census data are much lower than the populations reporting themselves simply as Ethiopian.
Cool. You can modify parameters and variables to generate a variety of maps, but note that different projections may be more relevant if zooming in on particular areas, etc. Example map below is proportion of Somali ethnic to Somalian national origins.
Part 3: Generating summary tables
I want summary tables of stats for each year – specifically, the proportion of ethnic groups to places of origin (e.g., Somalis to Somalia, Ethiopians to Ethiopia), and the proportion of people within ethnic groups reporting multiple ancestry as opposed to single ancestry.
Step 1 is to calculate these proportions. For the 1996 data, we have estimates for Somalis, Ethiopians, and Eritreans. For later years, additional relevant ethnic groups can be found in the data, but in every case a similar procedure to this can be followed:
a<-read.csv("Canada1996_CMAs.csv", stringsAsFactors = FALSE)
a = a[,-c(1)] #Delete first column
a = a[,c(1,15:17,2:14)] #Reorder columns to get ID variables up front
#Now, one of the main variables I'm interested in for this study is
#the proportion of people reporting ethnic descent to birth in a particular country.
#So, I'll create new columns of this basic proportion.
a$Somali_prop_1996 = a$Somali_total_1996/a$Somalia_bn_1996
a$Ethiopian_prop_1996 = a$Ethiopian_total_1996/a$Ethiopiabn_recimm_1996
a$Somali_propmult_1996 = a$Somali_mult_1996/a$Somali_total_1996
a$Ethiopian_propmult_1996 = a$Ethiopian_mult_1996/a$Ethiopian_total_1996
a$Eritrean_propmult_1996 = a$Eritrean_mult_1996/a$Eritrean_total_1996
#Replace all infinite values in data frame before calculating summary statistics
Since in some cases there might be infinite values returned from this operation (e.g., all ethnic Somalis in one area are not from Somalia, or data was entered wrong), we’ll eliminate these values. This is a useful chunk of code to know in general when calculating proportions or percentages that may return infinite values. I then subset the data, dividing into “primary” divisions and a separate set excluding the sum values that were created for all of Canada.
a<-do.call(data.frame, lapply(a, function(x)
replace(x, is.infinite(x),NA)))
b<-a[a$Geotype=="primary",]
c<-a[!a$Geography=="Total",]
The nested commands as.data.frame(unlist(summary( ))) will give us a data frame with summary statistics. Then we’ll specify the names of these stats and reshape the data frame so that each row contains the summary stats for one variable in one census year. This shows you how to organize the data frame step by step to get one row that includes a variable and its summary statistics, with the appropriate labels.
d = data.frame(unclass(summary(b$Somali_prop_1996)))
z = d
#First go-round to set up data frame and check...
z$stat = c("Min", "Q1", "Med", "Mean", "Q3", "Max", "NAs")
z$variable = gsub(".\.b\.", "", colnames(z)[1])
z = reshape(z, direction="wide", timevar="stat", idvar="variable")
z = z[,c(1,4,5,7,8)]
colnames(z) = gsub(".\.\.\.", "", colnames(z))
z$Geography = "CMAs"
z$Obs = dim(b)[1]-z$NAs
This is a lot of work to go through each variable, but you can just write a loop to process the remainders and append them all as rows in the data frame. This operation involves: (a) selecting the columns you want summary stats for (identify in the format “i in c(column numbers)”); (b) subsetting geographies so we don’t duplicate (here I do “Total”, which is all of Canada, and “primary”, which eliminates the duplicate sub-geographies according to the way I set up the 1996 data; this will have to be adjusted for other datasets); (c) creating a new dataset for each variable; and (d) using “rbind()” command to append them together in the dataset z that we created above.
for(i in c(18:22)){
b = a[a$Geotype=="primary",]
c = a[a$Geography=="Total",]
name = colnames(b)[i]
name2 = colnames(c)[i]
colnames(b)[i] = "select"
colnames(c)[i] = "select"
d = data.frame(unclass(summary(b$select)))
e = data.frame(unclass(summary(c$select)))
d$stat = c("Min", "Q1", "Med", "Mean", "Q3", "Max", "NAs")
d$variable = name
d = reshape(d, direction="wide", timevar="stat", idvar="variable")
d = d[,c(1,4,5,7,8)]
colnames(d) = c("variable", "Med", "Mean", "Max", "NAs")
d$Geography="CMAs"
d$Obs = dim(b)[1] - d$NAs
z = rbind(z, d)
e$stat = c("Min", "Q1", "Med", "Mean", "Q3", "Max")
e$variable=name2
e = reshape(e, direction="wide", timevar="stat", idvar="variable")
e = e[,c(1,4,5,7)]
colnames(e) = c("variable", "Med", "Mean", "Max")
e$NAs = NA
e$Geography="Canada"
e$Obs = 1
z = rbind(z, e)
}
Again, since this is a teaching script, I’ll let you figure out the code for 2001-2016 data. It’s very similar, but you also need to be attuned to any minor differences between dataset formats from different years—mainly column numbers relevant for calculations. Once you calculate the relevant proportions for each year, you can simply select the relevant columns, run a loop (possibly making some minor code adjustments), and append the rows to the existing z data frame.
After this, I do a little organization to get data organized in a format to plot in ggplot2. This means we’ll need a column for origins, a column specifying the year, and one for the type of proportion.
I put these identifiers in the first few columns and leave the data in the rest, for easier viewing. Now I can export the whole thing as a CSV.
z$Year = gsub(".\_1", "1", z$variable)
z$Year = gsub(".\2", "2", z$Year)
z$Year = gsub("[^0-9]", "", z$Year)
z$Year = as.numeric(z$Year)
z$variable = gsub("\[0-9].", "", z$variable)
z$variable = gsub("Prop\_", "", z$variable )
z$variable = gsub("n$", "n_prop", z$variable)
z$variable = gsub("Somali$", "Somali_prop", z$variable)
z$variable = gsub("propmult", "mult", z$variable)
z$variable = gsub("Tigrean", "Tigray", z$variable)
unique(z$variable)
z$stat = gsub(".\", "", z$variable)
z$variable = gsub("\.*", "", z$variable)
z = z[,c(1,6,8,9,7,2:5)]
write.csv(z, "Canada_PROPS_1996_2016.csv")
Once this is complete, we can start plotting. There are two main variables we want to consider: ethnicity to birth location, and populations reporting multiple ethnicity, where we might see some overlaps between ethnic categories that appear to be distinct or divergences between some that would seem to be essentially synonymous.
Part 4: Analyzing proportions of Ethnicity to birth location
Plotting
Given sub-ethnic/sub-national group differences and differing histories of migration, it is likely that there is some skew in the data. For example, ethnic Somalis from Ethiopia or Kenya might be clustered in specific locations, meaning that certain areas would have much higher proportions of ancestry to birth. So it’s relevant to look at the consistencies and differences across multiple indicators.
First, for Canada as a whole, where there is only one proportion of ancestry to birth for each population (that is, there is no mean, median). Here is the example code for Canada as a whole, including code to export the ggplot graph as a .jpg image.
a = z[z$Geography=="Canada",]
b = z[z$Geography=="Provinces",]
c = z[z$Geography=="CMAs",]
library(ggplot2)
####Ancestry to birth location####
gg = ggplot(a[a$stat=="prop",], aes(x=Year, y=Med, group=variable))+
geom_line(aes(linetype=variable), lwd=0.75)+
scale_linetype_manual("Origin",values = c(1,2,3,4))+
theme_bw()+
labs(y="Prop. ancestry to\nbirth location, Canada total")
gg
jpeg("Canada_Prop.jpg", height=4, width=6, units="in", res=400)
print(gg)
dev.off()
gg2 = ggplot(b[b$stat=="prop",], aes(x=Year, y=Med, group=variable))+
geom_line(aes(linetype=variable), lwd=0.75)+
scale_linetype_manual("Origin",values = c(1,2,3,4))+
theme_bw()+
labs(y="Median prop. ancestry\nto birth location, provinces")
gg2
jpeg("Provinces_Prop.jpg", height=4, width=6, units="in", res=400)
print(gg2)
dev.off()
Follow this procedure but adjust the ggplot code to look at medians and means for other geographical units such as provinces and CDs/CMAs.
T-test differences in means between groups
Now I want to look at summary statistics for proportions of ethnicity to birth location, focusing on mean and median—and also consider t-tests between them to measure whether there is a significant difference in mean proportions between groups. We can take a first glance at these summary stats by printing the spreadsheet that we created using the code loops above. In the publication, Table 2 displays the proportion of ethnicity to birth location for groups from the Horn of Africa.
For the t-tests, we will want not the summary data, but the original data since we need the whole sample so that the calculation can account for variance, etc. So we go back to the original data and set up the variables we want:
setwd("I://R/US_Census/Canada_data")
a = read.csv("Canada1996_CMAs.csv")
a = a[,-c(1)]
a$Somali_prop_1996 = a$Somali_total_1996/a$Somalia_bn_1996
a$Ethiopian_prop_1996 = a$Ethiopian_total_1996/a$Ethiopiabn_recimm_1996
a$Somali_propmult_1996 = a$Somali_mult_1996/a$Somali_total_1996
a$Ethiopian_propmult_1996 = a$Ethiopian_mult_1996/a$Ethiopian_total_1996
a$Eritrean_propmult_1996 = a$Eritrean_mult_1996/a$Eritrean_total_1996
a<-do.call(data.frame, lapply(a, function(x)
replace(x, is.infinite(x),NA)))
Then we select the geography we want and use the nested commands as.data.frame(unlist(t.test( ))) to create a data frame with the t-test results. This is the same structure that we used to do summary statistics. The resulting data frame is in long format, and we want it in wide format so that we can put multiple measures and multiple years together as separate cases (rows). So we’ll reshape to wide. For the first go-round, rename the data set; for subsequent go-rounds, change the variables or the geography type and then use rbind( ) to append new rows to the first row of data you created.
a2 = a[a$Geotype=="primary",]
x = as.data.frame(unlist(t.test(a2$Somali_prop_1996,
a2$Ethiopian_prop_1996,
paired=FALSE)))
x$stat = c("T-stat", "Parameter", "p-value", "CI-low", "CI-high",
"Mean-X", "Mean-Y", "Null", "Alt", "Method", "Data")
x$variable = "Somaliprop_Ethprop"
x = reshape(x, direction="wide", timevar="stat", idvar="variable")
colnames(x) = gsub(".*\)\)\.", "", colnames(x))
x$Geography = "CMAs"
#For the first go-round, rename x as "z"
z = x
#For subsequent go-rounds if you want t-tests for other variables,
#bind them together
z = rbind(z, x)
Now since I only have a few datasets for which I need to complete this operation, I’m just going to copy/paste code and change the year, then select the t-test types that I want. It would be fairly straightforward to write a code loop for this operation, but since I’m interested only in a couple columns from each of five datasets, I’ll just do it manually since it would take about just as much time to identify the column numbers I need and make sure the loop runs smoothly (again, I’m no computer scientist and I’m sure that much of this code is clunky).
a = read.csv("Canada_2001_CMAs.csv", stringsAsFactors = FALSE)
a = a[,-c(1)]
a$Somali_prop_2001 = a$Somali_total_2001/a$Somalia_bn_total
a$Ethiopian_prop_2001 = a$Ethiopian_total_2001/a$Ethiopia_bn_total
a$Eritrean_prop_2001 = a$Eritrean_total_2001/a$Eritrea_bn_total
a$Kenyan_prop_2001 = a$Kenyan_total_2001/a$Kenya_bn_total
a = do.call(data.frame, lapply(a, function(x)
replace(x, is.infinite(x), NA)))
b = a[a$Geoid>100,]
c = a[a$Geoid>1 & a$Geoid<100,]
{x = as.data.frame(unlist(t.test(b$Kenyan_prop_2001,
b$Somali_prop_2001,
paired=FALSE)))
x$stat = c("T-stat", "Parameter", "p-value", "CI-low", "CI-high",
"Mean-X", "Mean-Y", "Null", "Alt", "Method", "Data")
x$variable = "Kenprop_Somprop"
x = reshape(x, direction="wide", timevar="stat", idvar="variable")
colnames(x) = gsub(".*\\)\\)\\.", "", colnames(x))
x$Geography = "CMAs"
z = rbind(z, x)
}
The results of these t-tests are examined in the part of the text accompanying tables 2 and 3.
4B: Ethnicity to birth location, first generation only
For later years, the data includes estimates of ethnicity and birth for each of the first three generations of each group. For the above procedure I used the total estimates of ethnicity, since this is comparable to the US ACS data that I am concerned to use for comparison. But once I’ve established that the proportion of ethnic Somalis to people born in Somalia is similar between the Canada and the US data, I can use Beyond 20/20 to export the same tables including only estimates of first-generation immigrants from the Horn of Africa. These generation estimates are included in the dataset I’ve called “replicationdata”. So, using that data, here’s the code I used to generate the plot in Figure 7 in the article:
z = read.csv("Canada_2016_Replicationdata.csv", stringsAsFactors = FALSE)
z$Geoid = as.numeric(z$Geoid)
z1 = z[z$Geoid==1,]
summary(z1$Propest_Somali_2016)
summary(z1$Propest_Ethiopia_2016)
z2 = z[z$Geoid>100,]
summary(z2$Propest_Somali_2016)
summary(z2$Propest_Ethiopia_2016)
t.test(z2$Propest_Somali_2016, z2$Propest_Ethiopia_2016,
paired=FALSE)
plot(z2$Propest_Somali_2016, z2$Propest_Ethiopia_2016)
z2 = z2[z2$Propest_Somali_2016>0 &
z2$Propest_Ethiopia_2016>0,]
z2 = z2[!is.na(z2$Propest_Somali_2016),]
z2 = z2[!is.na(z2$Propest_Ethiopia_2016),]
summary(lm(z2$Propest_Somali_2016~z2$Propest_Ethiopia_2016))
library(ggplot2)
gg = ggplot(z2, aes(x=Propest_Ethiopia_2016, y=Propest_Somali_2016))+
geom_point(size=3, pch = 21, fill="grey30")+
geom_smooth(method="lm")+
theme_bw()+
labs(y = "Proportion first-gen. Somalis\n to population born in Somalia",
x = "Proportion first-gen. Ethiopians\n to population born in Ethiopia")
gg
Part 5: Single and multiple ethnicity populations
Since my analysis of the growth of multiple-ethnicity populations in the US ACS data produced interesting results regarding the apparent dramatic growth of multiple-ancestry populations over the past decade, I wanted to compare and see if this was the case with Canada data. First, then, to look at multiple ancestry among HoA populations on the whole, and then to look at first-generation trends, which I could not do with the US data at the time I conducted that analysis.
If you followed my procedure in Part 3 above successfully, you should have a data set containing average proportions of (a) ethnic to national origins, which we used in the preceding section; and (b) multiple to single ethnicity, which we will use in the first part of this section. After clearing out the R console, we’ll reload that data including proportions for these measures, then select the variables we want, and put it in ggplot. Here is the plot of the total proportions for Canada as a whole, Figure 8 in the article.
setwd("I://R/US_Census/Canada_data")
z = read.csv("Canada_PROPS_1996_2016.csv")
z = z[z$variable=="Ethiopian"|z$variable=="Eritrean"|
z$variable=="Somali",]
a = z[z$Geography=="Canada",]
b = z[z$Geography=="Provinces",]
c = z[z$Geography=="CMAs",]
gg = ggplot(a[a$stat=="mult",],
aes(x=Year, y=Med, group=variable))+
geom_line(aes(linetype=variable), lwd=0.75)+
scale_linetype_manual("Ethnicity", values=c(1,2,3))+
theme_bw()+
scale_x_continuous(breaks = c(1996, 2001,2006,2011,2016))+
labs(y="Prop. multiple ethnicity\nto total ethnicity, Canada total")
gg
I also looked at different measures and different geographies, though the charts were not included in the article. Though the data offers more helpful variables than the US ACS data, the skew of data and the small number of cases makes it tough. Really, most people tracing ancestry or birth to the Horn of Africa are concentrated in only a few locations in Canada, so in order to get a decent sample size of geographic units, we’d include some with extremely small populations. Anyways, maybe using future census data or another approach, someone can critique my findings here. Here’s a sample to create the above plot using a geography of CMAs rather than Canada as a whole, and the measure as the median proportion of multiple ethnicity to total ethnicity.
gg2 = ggplot(b[b$stat=="mult",],
aes(x=Year, y=Med, group=variable))+
geom_line(aes(linetype=variable), lwd=0.75)+
scale_linetype_manual("Ethnicity",values=c(1,2,3))+
theme_bw()+
scale_x_continuous(breaks=c(1996,2001,2006,2011,2016))+
labs(y="Median prop. multiple ethnicity\nto total ethnicity, provinces")
gg2
The code for the Amhara, Ethiopian and Oromo chart (Fig. 10) is exactly the same as above, using Canada as a whole and subsetting the data to include those ethnicities. Likewise, the code for Figure 9 is the same, but I went back and exported the single-and multiple ancestry tables for only first-generation Somalis, Ethiopians, etc. from Beyod 20/20 in order to create it.
The remainder of the census data analysis is basic, just looking at populations in Toronto and Edmonton.