Just Another Data Science Blog...

How can One not like Ggplot2?

In the previous a previous post we created our credentials on Kaggle, created an API key and retrieved some data from the website via command-line. In this post, we'll be using R to parse the downloaded data and work through building a figure with ggplot2. It's a very nice graphics library and I really enjoy using it when plotting data or creating visuals. In my opinion, ggplot2 is, hands down, better than matplotlib. While matplotlib is powerful, ggplot2 is easier to use and it just looks better; though you can render matplotlib in ggplot2 like styling.

To get started, we need to call the library function to pull in packages we want to use to parse data. We'll be using dplyr and ggplot2. We'll also be reading in the CSV data from Kaggle we downloaded in the last post. The data can be visualized using the head command.

library(dplyr)
library(ggplot2)
df <- read.csv('../data/Crops_AllData_Normalized.csv')
head(df,5)
A data.frame: 5 × 11
Area.CodeAreaItem.CodeItemElement.CodeElementYear.CodeYearUnitValueFlag
<int><fct><int><fct><int><fct><int><int><fct><dbl><fct>
2Afghanistan221Almonds, with shell5312Area harvested19751975ha 0F
2Afghanistan221Almonds, with shell5312Area harvested19761976ha5900F
2Afghanistan221Almonds, with shell5312Area harvested19771977ha6000F
2Afghanistan221Almonds, with shell5312Area harvested19781978ha6000F
2Afghanistan221Almonds, with shell5312Area harvested19791979ha6000F

Next we're going to use some functions from the dplyr package to manipulate our dataframe so we can plot data. For this post I'm going to grab information about maize and soybeans in the United States and China. We'll look at area harvested vs production characteristics for the two countries and crops. To do this, we'll call the filter command. We'll then use select to pick the columns we need for plotting. After saving the information to a new variable crops we can visualize our work using the head command again.

crops <- (
    df %>% filter(
        Area %in% c('United States of America', 'China') & 
        Item %in% c("Soybeans", "Maize") & 
        Element %in% c('Area harvested', "Production"))
    %>% select(Area, Item, Year, Element, Value, Flag, Unit)
    )
tail(crops, 5)
A data.frame: 5 × 7
AreaItemYearElementValueFlagUnit
<fct><fct><int><fct><dbl><fct><fct>
468United States of AmericaSoybeans2015Production106953940tonnes
469United States of AmericaSoybeans2016Production116931500tonnes
470United States of AmericaSoybeans2017Production120064970tonnes
471United States of AmericaSoybeans2018Production120514490tonnes
472United States of AmericaSoybeans2019Production 96793180tonnes

Now we'll create our first graph. This first graph won't make sense and will look ugly but we'll apply features in successive steps. We'll call the ggplot command and apply crops to the data argument. The next step is applying aesthetic mappings to our graph using aes. For data points we'll be looking at Production and Area harvested by year using the value column. Production is measured in tonnes while Area harvested is measured in hectares represented by ha. Since we're creating a line graph, we'll color the lines by the Element. Calling this piece of code will render a figure without data. In order to actually add data we have to include a geom or geometric object. In our case, we'll add geom_line to plot our data as a line-graph.

p1 <- (
    ggplot(crops, aes(Year, Value, colour=Element)) + 
    geom_line()
    )
p1

As you can see, this is a pretty ugly graph. Ggplot2 is treating Production and Area harvested as sequential data creating vertical lines at the same points on the x-axis. No worries, we'll get it straightened out. I want to add points to the graph to better visualize the data. To do this we'll add the geom_point object.

p1 <- p1 + geom_point()
p1

Here you can see the data is a little more segregated and we can start to see what the data might look like. However, you can't distinguish between the data representing the United States and China, or between maize and soybean. To fix this, we apply the facet_wrap function to break out the data into multiple panels. We can create four panels representing each country/crop combination. We'll also "free" the y-axis allowing ggplot to better represent the data for that particular element.

p1 <- p1 + facet_wrap(Area~Item, scales="free_y")
p1

While not complete, this is a much better visual. The data looks coherent and we can distinguish between crops and countries. The next step is to refine the visual. I'm going to apply some a changes using the theme function. The size of the figure, marker dots and lines will be changed as well. I'll also switch the legend position. The code below is putting it all together and regraphing the data into a figure that we can almost call complete.

p1 <- (
    ggplot(crops, aes(Year, Value, colour=Element)) + 
    geom_line(alpha=.5, size=.2) +
    geom_point(size= .7) + 
    facet_wrap(Area~Item, scales="free_y") +
    theme(
        aspect.ratio = 1,
        legend.position = c(0.15, 0.9) 
    )
)
p1

The graphs are really starting to come together. The last few items involve adding a title and reordering the levels of the Element column so the legend better reflects the data. Below we can see the levels in Element. The code in the next block is reversing the order of the levels using rev.

levels(crops$Element)
  1. 'Area harvested'
  2. 'Production'
  3. 'Yield'
crops$Element <- factor(crops$Element, levels = rev(levels(crops$Element)))

Now we're ready to plot our final graph. Below is the code with the new title and application of reversed levels. You can see that the legend better aligns better with the graph with Production represented by red, placed on top because the majority of data points reflect this arrangement.

p1 <- (
    ggplot(crops, aes(Year, Value, colour=Element)) + 
    geom_line(alpha=.5, size=.2) +
    geom_point(size= .7) + 
    facet_wrap(Area~Item, scales="free_y") +
    theme(
        aspect.ratio = 1,
        legend.position = c(0.15, 0.9),
        plot.title = element_text(hjust = 0.5)
    ) +
    labs(title = "Production and Area Harvested in \n US and China of Maize and Soybean According to FAO")
)
p1

We have arrived at our final output. Take a look at the top right panel displaying soybean production in China. Though I have to take these numbers with a grain of salt since they're from Kaggle and I haven't vetted the data, it appears soybean acreage in China is trending down. If you juxtapose that info with the rapid acceleration in the production in maize, it appears China has substituted corn for soybean. It's interesting that corn acreage in the US has crept up slowly while soybean expanded rapidly. I guess that's a testament to the limited, premium topsoil in the mid-western US.

Ggplot2 is my favorite plotting library. It's easy to use and creates beautiful, professional figures for publications and presentations. If you have the time, I highly recommend learning it.