Using Google Sheets to collaborate is a great way to get work accomplished if you and your collaborator/s:
- have different schedules and upload data at different times.
- collect different data that needs to be aggregated in the same place,
- want to work off the same kind of data.
- don't use a formal database for data collection.
As an active data scientist, while I would prefer to have data captured in a queryable database, sometimes that might not be an option; especially if that data is a one-off experiment. Luckily, Google Sheets has an api to which I can connect and programmatically manipulate data. This blog post will walk through some of these actions using Python.
Before jumping into the code, I am presuming you have a familiarity with Google's service account system. To quote from Google's cloud documentation, "A service account is a special type of Google account intended to represent a non-human user that needs to authenticate and be authorized to access data in Google APIs." You can read more about it here. If you don't have a service account you won't be able to access data from a spreadsheet.
Once your service account is set up we can get started by installing some packages. First we'll install the google-api-python-client package. This api will let us use Google's APIs. Next, we'll install the dataframe manipulator package, pandas, followed up with boto3, AWS's API Python client library. We'll use this library to get our secret account credentials from AWS. The json and os packages are needed for data formatting and environment variable calling, respectively. Finally, we'll install the faker package. This last one will help us generate some nice dummy data, on which to experiment, while testing out the Sheets API.
from google.oauth2 import service_account
from googleapiclient.discovery import build
import pandas as pd
import boto3
import json
import os
from faker import Faker
Next, we need to create and new spreadsheet to which we'll connect using the block of code below. Type sheets.new
into your browser and a new Google spreadsheet workbook will be created. Be sure to give it a title. You'll next want to grab the workbook ID from the URL bar.
I placed mine into an environment variable and passed it to my notebook environment.
Alright, let's connect to the Sheets API. We'll need to retrieve our service account credentials. I saved mine in AWS secret's manager service. Using boto3, I can pass these credentials to the sheets service, authorizing connection.
sm = boto3.client("secretsmanager")
secret_json = sm.get_secret_value(SecretId="trimble-service-account-google-apis")[
"SecretString"
]
secret_data = json.loads(secret_json)
scopes = ["https://www.googleapis.com/auth/spreadsheets"]
creds = service_account.Credentials.from_service_account_info(
secret_data, scopes=scopes
)
service = build("sheets", "v4", credentials=creds)
sheets = service.spreadsheets()
sheet_id = os.environ['SHEET_ID']
The first thing we're going to do is add data to the spreadsheet. We'll use the faker module to generate some random data. We can call the profile() method which creates a "person" with various information include a Social Security Number, address, name, etc. It's already formatted correctly for incorporation into a dataframe with pandas. Let's generate the data and visualize it calling the head() method on the dataframe.
fake = Faker()
profileData = [fake.profile() for i in range(100)]
df = pd.DataFrame(profileData)
df.head()
Let's down select from some of the data to get a managable set. Below I selected 4 columns of data I'll use to demonstrate Google sheet uploading.
upload_df = df[['job', 'company', 'blood_group', 'username']]
upload_df.head()
To start the uploading of values to Sheets, we need to create a list of lists. Each row represents a list of data. We'll combine these into another list for uploading. We then have some parameters to pass to the API. These include the range which can accept the name of the sheet to which you're putting the data or a range of values like A1:B10
. For our case, I'll use the sheet name, Sheet1
. Another parameter is value_input_option
. This lets Sheets infer the data type or represent the information as is. I want it to be represented as is so I'll pass RAW
as an input.
values = [upload_df.columns.values.tolist()]
values.extend(upload_df.values.tolist())
data = [
{'range' : 'Sheet1',
'values' : values}
]
batch_update = {
'value_input_option': 'RAW',
'data': data
}
request = (
service.spreadsheets()
.values()
.batchUpdate(
spreadsheetId=sheet_id,
body=batch_update
)
.execute()
)
If you navigate back to the workbook in your browser, you'll see that the data has made it's way into the sheet just as it was in your dataframe. We'll extract the data we just uploaded to demonstrate getting information from a sheet.
We'll first need to describe the range of cells from which we'll be pulling data. Google Sheets uses A1 notation. You first provide the sheet name, followed by an ! and then a range of values. Below I have Sheet1!:E101
. I'm compensating for column headers. This will extract every value in columns A - E and rows 1 - 101.
_r = "Sheet1!A1:E101"
Because the data returned in the response is provided as a list of lists format, we can pass it directly to the pd.DataFrame()
method of pandas.
new_data = (
pd.DataFrame(
sheets.values()
.get(spreadsheetId=sheet_id, range=_r, majorDimension='ROWS')
.execute()['values'])
)
new_data.head()
The column headers are returned as generic data values and are the first data in each column. We can set that row as the column names. The code below accomplishes this for us. We can finally visualize that data in workable fashion.
new_data.columns = new_data.iloc[0]
new_data.drop(index=[0], inplace=True)
new_data.head()
Working with the Google Sheets API can be difficult at times because the documentation isn't the best. However, when you get it down it is extremely useful for collaborating with colleagues who are not as data or code fluent.