A great website for finding datasets to help expand skills and expertise in data science is Kaggle. In this short post, we'll discuss a quick set-up of Kaggle at the command line and then pull a dataset from the many available. It will be followed up with a post where we'll use the R-programming language to build a plot with Ggplot2. Before getting started, be sure to set up an account on Kaggle. You'll it to request an API key. After setting up an account, we'll install the Kaggle package using your Python package manager, PIP. Run pip install kaggle
to download the Kaggle package locally and create a hidden .kaggle/ folder in your home directory.
Next, we'll navigate to our account settings. Select the Create New API Token button to download a kaggle.json file that you'll then need to store in the hidden .kaggle/ folder discussed in the previous paragraph. This will allow us to access datasets programmatically.
Now we're ready to download some data. For visualizations in the next blog I'd like to use the dataset provided by raghavramasamy. This person pulled information about world-wide crop statistics on many different crops from Food and Agricultural Organization (FAO). This is a nice dataset because it's already in tidy format. To download this dataset run: kaggle download datasets -d raghavramasamy/crop-statistics-fao-all-countries
. It will download a file to your local directory. Once this is complete, we're ready to using the data.