Just Another Data Science Blog...

Working with the Google Sheets API in Python

Using Google Sheets to collaborate is a great way to get work accomplished if you and your collaborator/s:

  1. have different schedules and upload data at different times.
  2. collect different data that needs to be aggregated in the same place,
  3. want to work off the same kind of data.
  4. don't use a formal database for data collection.

As an active data scientist, while I would prefer to have data captured in a queryable database, sometimes that might not be an option; especially if that data is a one-off experiment. Luckily, Google Sheets has an api to which I can connect and programmatically manipulate data. This blog post will walk through some of these actions using Python.

Before jumping into the code, I am presuming you have a familiarity with Google's service account system. To quote from Google's cloud documentation, "A service account is a special type of Google account intended to represent a non-human user that needs to authenticate and be authorized to access data in Google APIs." You can read more about it here. If you don't have a service account you won't be able to access data from a spreadsheet.

Once your service account is set up we can get started by installing some packages. First we'll install the google-api-python-client package. This api will let us use Google's APIs. Next, we'll install the dataframe manipulator package, pandas, followed up with boto3, AWS's API Python client library. We'll use this library to get our secret account credentials from AWS. The json and os packages are needed for data formatting and environment variable calling, respectively. Finally, we'll install the faker package. This last one will help us generate some nice dummy data, on which to experiment, while testing out the Sheets API.

from google.oauth2 import service_account
from googleapiclient.discovery import build
import pandas as pd
import boto3
import json
import os
from faker import Faker

Next, we need to create and new spreadsheet to which we'll connect using the block of code below. Type sheets.new into your browser and a new Google spreadsheet workbook will be created. Be sure to give it a title. You'll next want to grab the workbook ID from the URL bar.

I placed mine into an environment variable and passed it to my notebook environment.

Alright, let's connect to the Sheets API. We'll need to retrieve our service account credentials. I saved mine in AWS secret's manager service. Using boto3, I can pass these credentials to the sheets service, authorizing connection.

sm = boto3.client("secretsmanager")
secret_json = sm.get_secret_value(SecretId="trimble-service-account-google-apis")[
    "SecretString"
]
secret_data = json.loads(secret_json)
scopes = ["https://www.googleapis.com/auth/spreadsheets"]

creds = service_account.Credentials.from_service_account_info(
    secret_data, scopes=scopes
)
service = build("sheets", "v4", credentials=creds)
sheets = service.spreadsheets()
sheet_id = os.environ['SHEET_ID']

The first thing we're going to do is add data to the spreadsheet. We'll use the faker module to generate some random data. We can call the profile() method which creates a "person" with various information include a Social Security Number, address, name, etc. It's already formatted correctly for incorporation into a dataframe with pandas. Let's generate the data and visualize it calling the head() method on the dataframe.

fake = Faker()
profileData = [fake.profile() for i in range(100)]
df = pd.DataFrame(profileData)
df.head()
job company ssn residence current_location blood_group website username name sex address mail birthdate
0 Lecturer, higher education Hart PLC 164-04-3123 13655 Ashley Unions Suite 616\nLoveborough, ID... (53.0466615, -82.546387) A+ [http://www.howell-sharp.net/, https://www.rob... youngzachary Cassandra Alvarado F 63480 Liu Run\nNorth Anne, OR 10500 crystal50@yahoo.com 1972-12-03
1 Records manager Walker, Patton and George 180-92-5579 21263 Pamela Extensions\nVillamouth, KS 96993 (-78.0920215, -165.065549) O- [https://campbell-reed.com/] nelsonsamantha Michael Blake M 67091 Adam Wall Suite 530\nNorth Lisa, AK 93355 asmith@hotmail.com 1937-08-24
2 Visual merchandiser Kelley-Simmons 475-02-6595 5506 Riggs Junctions\nTaylorville, PA 48609 (10.722590, -41.148270) A+ [https://www.brown.net/, http://www.davis.com/... jefferysmith Richard Jones M 72826 Norman Grove\nAlantown, MD 17603 imelendez@yahoo.com 1993-08-01
3 Mining engineer Warren, Walsh and Myers 894-20-0037 31788 Crane Extension\nGarciafort, MS 74611 (25.205013, 133.580372) A+ [https://www.higgins.info/, https://cline.com/] josephbell Jessica Velez F 7032 Devin Groves Apt. 845\nNew Emilyville, UT... brandonbright@yahoo.com 1911-03-21
4 Event organiser Peterson-Johnson 609-50-0924 97149 Shannon Divide\nNorth Williamberg, UT 09107 (82.5767035, -111.439504) O- [https://www.smith.com/] markadams Jennifer Hernandez F Unit 1455 Box 1494\nDPO AA 37368 wpeterson@hotmail.com 1965-10-06

Let's down select from some of the data to get a managable set. Below I selected 4 columns of data I'll use to demonstrate Google sheet uploading.

upload_df = df[['job', 'company', 'blood_group', 'username']]
upload_df.head()
job company blood_group username
0 Lecturer, higher education Hart PLC A+ youngzachary
1 Records manager Walker, Patton and George O- nelsonsamantha
2 Visual merchandiser Kelley-Simmons A+ jefferysmith
3 Mining engineer Warren, Walsh and Myers A+ josephbell
4 Event organiser Peterson-Johnson O- markadams

To start the uploading of values to Sheets, we need to create a list of lists. Each row represents a list of data. We'll combine these into another list for uploading. We then have some parameters to pass to the API. These include the range which can accept the name of the sheet to which you're putting the data or a range of values like A1:B10. For our case, I'll use the sheet name, Sheet1. Another parameter is value_input_option. This lets Sheets infer the data type or represent the information as is. I want it to be represented as is so I'll pass RAW as an input.

values = [upload_df.columns.values.tolist()]
values.extend(upload_df.values.tolist())
data = [
    {'range' : 'Sheet1', 
     'values' : values}
]
batch_update = {
    'value_input_option': 'RAW',
    'data': data 
}

request = (
    service.spreadsheets()
    .values()
    .batchUpdate(
        spreadsheetId=sheet_id,
        body=batch_update
    )
    .execute()
)

If you navigate back to the workbook in your browser, you'll see that the data has made it's way into the sheet just as it was in your dataframe. We'll extract the data we just uploaded to demonstrate getting information from a sheet.

We'll first need to describe the range of cells from which we'll be pulling data. Google Sheets uses A1 notation. You first provide the sheet name, followed by an ! and then a range of values. Below I have Sheet1!:E101. I'm compensating for column headers. This will extract every value in columns A - E and rows 1 - 101.

_r = "Sheet1!A1:E101"

Because the data returned in the response is provided as a list of lists format, we can pass it directly to the pd.DataFrame() method of pandas.

new_data = (
    pd.DataFrame(
        sheets.values()
        .get(spreadsheetId=sheet_id, range=_r, majorDimension='ROWS')
        .execute()['values'])
)
new_data.head()
0 1 2 3 4
0 job company blood_group username id_number
1 Lecturer, higher education Hart PLC A+ youngzachary 1
2 Records manager Walker, Patton and George O- nelsonsamantha 2
3 Visual merchandiser Kelley-Simmons A+ jefferysmith 3
4 Mining engineer Warren, Walsh and Myers A+ josephbell 4

The column headers are returned as generic data values and are the first data in each column. We can set that row as the column names. The code below accomplishes this for us. We can finally visualize that data in workable fashion.

new_data.columns = new_data.iloc[0]
new_data.drop(index=[0], inplace=True)
new_data.head()
job company blood_group username id_number
1 Lecturer, higher education Hart PLC A+ youngzachary 1
2 Records manager Walker, Patton and George O- nelsonsamantha 2
3 Visual merchandiser Kelley-Simmons A+ jefferysmith 3
4 Mining engineer Warren, Walsh and Myers A+ josephbell 4
5 Event organiser Peterson-Johnson O- markadams 5

Working with the Google Sheets API can be difficult at times because the documentation isn't the best. However, when you get it down it is extremely useful for collaborating with colleagues who are not as data or code fluent.