Just Another Data Science Blog...

Fun with USDA NASS Data - Part 4: Let's Make a Visualization

This is the fourth and last post in a series of posts working with the USDA NASS database. We can finally query the data we uploaded to AWS in the previous post here. In this post we'll be querying the data and making a visualization.

To start, we'll import the necessary Python modules that will allow us to query data from AWS using Athena. We'll also import Matplotlib and PyAthena. Matplotlib allows us to visualize the data and PyAthena allows us to connect to AWS and query data from Athena. Since I have multiple AWS accounts, I have to pass the correct profile to the PyAthena connector; my profile variable is represented by the AWS_LP string passed to the pf variable.

import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
from pyathena import connect
from pyathena.pandas_cursor import PandasCursor
ATHENA_RESULTS = os.environ["ATHENA_RESULTS"]
pf = os.environ['AWS_LP']
ATHENA_CURSOR = connect(s3_staging_dir=ATHENA_RESULTS,
                        region_name='us-east-1',
                        profile_name=pf,
                        cursor_class=PandasCursor).cursor()

After importing the required modules, we can query for some of the data. I pre-selected a few columns of data that we'll use to make our visualization. I'm selecting the selection to the year 2017.

query = '''
    SELECT location_desc, 
            value, 
            domaincat_desc
    FROM crop_data.usda_crop_data
    WHERE year = 2017
'''
df = ATHENA_CURSOR.execute(query).as_pandas()

The _locationsec column contains supplies us with the name of each state. The value column contains dollar values pertaining to each _domaincatdesc category. We can view the first couple of rows in our dataframe using the .head() method.

df.head()
location_desc value domaincat_desc
0 RHODE ISLAND 7,005,000 ORGANIZATION: (INSTITUTIONAL & RESEARCH & RESE...
1 RHODE ISLAND 23,467,000 ORGANIZATION: (TAX PURPOSES, CORPORATION)
2 RHODE ISLAND 1,074,000 ORGANIZATION: (TAX PURPOSES, CORPORATION, (EXC...
3 RHODE ISLAND 1,074,000 ORGANIZATION: (TAX PURPOSES, CORPORATION, (EXC...
4 RHODE ISLAND 22,393,000 ORGANIZATION: (TAX PURPOSES, CORPORATION, FAMI...

For our visualization we need to manipulate our dataframe into the correct format. I'd like to see what percent of a state's agriculture is irrigated; if it is at all. In order to do this, we have to select all the rows with the string 'IRRIGATION'. We next need to pivot our table and convert all the value column dollar amounts to actual integers. In the previous post, these values, after being requested, are returned as a string. We can then get the percentage of irrigation types in each state using a couple of other methods, sum() and transpose().

t = (
    df.loc[
        (~df['location_desc'].str.contains(',')) &
        (df['domaincat_desc'].str.contains('IRRIGATION'))
    ]
    .pivot(
        index='domaincat_desc', 
        columns='location_desc', 
        values='value'
    )
    .reset_index()
    .rename_axis(None, axis=1)
    .replace(to_replace=r',', value='', regex=True)
    .set_index('domaincat_desc')
    .apply(pd.to_numeric)
)
tt = (
    (t/t.sum())
    .transpose()
    .sort_values("IRRIGATION STATUS: (NONE ON OPERATION)", ascending=False)
)
tt.head()
domaincat_desc IRRIGATION STATUS: (ANY ON OPERATION) IRRIGATION STATUS: (HARVESTED CROPLAND IS ENTIRELY IRRIGATED) IRRIGATION STATUS: (NONE ON OPERATION)
IOWA 0.035864 0.004206 0.959930
NORTH DAKOTA 0.083086 0.008937 0.907977
WEST VIRGINIA 0.067256 0.033463 0.899282
VERMONT 0.082267 0.019014 0.898720
MINNESOTA 0.103812 0.012910 0.883279

Now that we have our data in the correct format, we can make a visualizaton. Using the matplotlib module, we'll make a stacked bar chart of the percentages of irrigation types. To start off, we can instantiate our figure instance using plt.figure(). The 6,16 you see below is where I retroactively added sizing as I didn't know how big the figure would be at the outset. Once we've built our figure, we can add each data piece to the figure by plotting seperate graphs. Below, each graph is denoted by p followed by a number. We're adding each data piece to our figure instance. We then add some annotations with the ylabel, xlabel, etc. Matplotlib then returns us a nicely formatted graph after hitting enter.

plt.figure(figsize=(16,6))
p1 = plt.bar(
    tt.index, 
    tt['IRRIGATION STATUS: (NONE ON OPERATION)'], 
    color='#7b74ea'
)
p2 = plt.bar(
    tt.index, 
    tt['IRRIGATION STATUS: (HARVESTED CROPLAND IS ENTIRELY IRRIGATED)'], 
    bottom=tt['IRRIGATION STATUS: (NONE ON OPERATION)'], 
    color='#f58c48'
)
p3 = plt.bar(
    tt.index, 
    tt['IRRIGATION STATUS: (ANY ON OPERATION)'], 
    bottom = tt['IRRIGATION STATUS: (HARVESTED CROPLAND IS ENTIRELY IRRIGATED)'] +
            tt['IRRIGATION STATUS: (NONE ON OPERATION)'], 
    color='grey'
)
plt.xticks(rotation=70)
plt.ylabel('Irrigation Type (%)', fontsize=14)
plt.xlabel("State", fontsize=14)
plt.title('Amount of Each Irrigation Type as a Percentage in Each State', fontsize=18)
plt.legend(
    ['No Irrigation', 'Entirely Irrigated', 'Any Irrigation on Property'],
    loc = 'lower left',
    fontsize=14,
    facecolor='white',
          )

Looking at our data, we can conclude that many of the western states are heavily irrigated. Nevada and California are the most irrigated states, where Iowa and North Dakota are the least irrigated.

This post concludes this series working with the USDA NASS data. At the beginning of the journey, I didn't know what to expect from the data. I randomly chose something to blog about. After digging into this particular dataset, I found some oddities. For example, random years like 2013 and 2018 are left out. It took me a while to write this post because I found, what appears to be, duplicate data entries. I chalked this down to my error, but going back and making the data requests again, there are indeed many rows with the same data. That makes for difficult data wrangling and results that are difficult to interpret. My last gripe relates to the capitalization used in many of the data pieces. That's very hard to work with. All that being said, no data is perfect and sometimes you have to make do with what's available; especially if you weren't the collector or developer of data conventions.