id = 'stars'
= 'tasiyagnunpa-migration-2023'
project_dirname = 'Tasiyagnunpa'
species_name = 'sturnella neglecta'
species_lookup = 'migration-stars-data'
sample_filename = 'gbif_tasiyagnunpa.csv'
gbif_filename = 'tasiyagnunpa_migration'
plot_filename = 500 plot_height
Migration Data Download
Get Tasiagnunpa occurrence data from the Global Biodiversity Information Facility (GBIF)
Before we get started, let’s define some parameters for the workflow. We’ll use these throughout to customize the workflow for this species:
Access locations and times of Veery encounters
For this challenge, you will use a database called the Global Biodiversity Information Facility (GBIF). GBIF is compiled from species observation data all over the world, and includes everything from museum specimens to photos taken by citizen scientists in their backyards.
Before your get started, go to the GBIF occurrences search page and explore the data.
You can get your own observations added to GBIF using iNaturalist!
Set up your code to prepare for download
We will be getting data from a source called GBIF (Global Biodiversity Information Facility). We need a package called pygbif
to access the data, which may not be included in your environment. Install it by running the cell below:
%%bash
pip install pygbif
In the imports cell, we’ve included some packages that you will need. Add imports for packages that will help you:
- Work with reproducible file paths
- Work with tabular data
import time
import zipfile
from getpass import getpass
from glob import glob
import pygbif.occurrences as occ
import pygbif.species as species
See our solution!
import os
import pathlib
import shutil
import time
import zipfile
from getpass import getpass
from glob import glob
import earthpy
import pandas as pd
import pygbif.occurrences as occ
import pygbif.species as species
Create a directory for your data
For this challenge, you will need to download some data to the computer you’re working on. We suggest using the earthpy
library we develop to manage your downloads, since it encapsulates many best practices as far as:
- Where to store your data
- Dealing with archived data like .zip files
- Avoiding version control problems
- Making sure your code works cross-platform
- Avoiding duplicate downloads
If you’re working on one of our assignments through GitHub Classroom, it also lets us build in some handy defaults so that you can see your data files while you work.
The code below will help you get started with making a project directory
- Replace
'your-project-directory-name-here'
with a descriptive name - Run the cell
- The code should have printed out the path to your data files. Check that your data directory exists and has data in it using the terminal or your Finder/File Explorer.
These days, a lot of people find your file by searching for them or selecting from a Bookmarks
or Recents
list. Even if you don’t use it, your computer also keeps files in a tree structure of folders. Put another way, you can organize and find files by travelling along a unique path, e.g. My Drive
> Documents
> My awesome project
> A project file
where each subsequent folder is inside the previous one. This is convenient because all the files for a project can be in the same place, and both people and computers can rapidly locate files they want, provided they remember the path.
You may notice that when Python prints out a file path like this, the folder names are separated by a /
or \
(depending on your operating system). This character is called the file separator, and it tells you that the next piece of the path is inside the previous one.
# Create data directory
= earthpy.Project(
project ='your-project-directory-name-here')
project_dirname# Download sample data
project.get_data()
# Display the project directory
project.project_dir
See our solution!
# Create data directory
= earthpy.Project(dirname=project_dirname)
project # Download sample data
project.get_data()
# Display the project directory
project.project_dir
**Final Configuration Loaded:**
{}
PosixPath('/home/runner/.local/share/earth-analytics/tasiyagnunpa-migration-2023')
Register and log in to GBIF
You will need a GBIF account to complete this challenge. You can use your GitHub account to authenticate with GBIF. Then, run the following code to enter your credentials for the rest of your session.
This code is interactive, meaning that it will ask you for a response! The prompt can sometimes be hard to see if you are using VSCode – it appears at the top of your editor window.
If you need to save credentials across multiple sessions, you can consider loading them in from a file like a .env
…but make sure to add it to .gitignore so you don’t commit your credentials to your repository!
Your email address must match the email you used to sign up for GBIF!
If you accidentally enter your credentials wrong, you can set reset=True
instead of reset=False
.
####--------------------------####
#### DO NOT MODIFY THIS CODE! ####
####--------------------------####
# This code ASKS for your credentials and saves it for the rest of the session.
# NEVER put your credentials into your code!!!!
# GBIF needs a username, password, and email -- all need to match the account
= False
reset
# Request and store username
if (not ('GBIF_USER' in os.environ)) or reset:
'GBIF_USER'] = input('GBIF username:')
os.environ[
# Securely request and store password
if (not ('GBIF_PWD' in os.environ)) or reset:
'GBIF_PWD'] = getpass('GBIF password:')
os.environ[
# Request and store account email address
if (not ('GBIF_EMAIL' in os.environ)) or reset:
'GBIF_EMAIL'] = input('GBIF email:') os.environ[
Get the species key
- Replace the
species_name
with the name of the species you want to look up - Run the code to get the species key
# Query species
= species.name_lookup(species_name, rank='SPECIES')
species_info
# Get the first result
= species_info['results'][0]
first_result
# Get the species key (speciesKey)
= first_result['speciesKey']
species_key
# Check the result
'species'], species_key first_result[
See our solution!
# Query species
= species.name_lookup("Sturnella neglecta Audubon, 1844", rank='SPECIES')
species_info
# Get the first result
= species_info['results'][0]
first_result
# Get the species key (speciesKey)
= first_result['speciesKey']
taxon_key
# Check the result
'species'], taxon_key first_result[
('Sturnella neglecta', 159147257)
Download data from GBIF
Replace
csv_file_pattern
with a string that will match any.csv
file when used in theglob
function. HINT: the character*
represents any number of any values except the file separator (e.g./
)Add parameters to the GBIF download function,
occ.download()
to limit your query to:- observations of Tasiyagnunpa
- from 2023
- with spatial coordinates.
Then, run the download. This can take a few minutes.
# Only download once
if not glob(str(project.project_dir / csv_file_pattern)):
# Submit query to GBIF
= occ.download([
gbif_query "speciesKey = ",
"year = ",
"hasCoordinate = ",
])# Only download once
if not 'GBIF_DOWNLOAD_KEY' in os.environ:
'GBIF_DOWNLOAD_KEY'] = gbif_query[0]
os.environ[
# Wait for the download to build
= occ.download_meta(download_key)['status']
wait while not wait=='SUCCEEDED':
= occ.download_meta(download_key)['status']
wait 5)
time.sleep(
# Download GBIF data
= occ.download_get(
download_info 'GBIF_DOWNLOAD_KEY'],
os.environ[=project.project_dir)
path
# Unzip GBIF data
with zipfile.ZipFile(download_info['path']) as download_zip:
=project.project_dir)
download_zip.extractall(path
# Find the extracted .csv file path (take the first result)
= glob(str(project.project_dir / csv_file_pattern))[0]
original_gbif_path original_gbif_path
See our solution!
# Only download once
if not glob(str(project.project_dir / '*.csv')):
# Only submit one request
if not 'GBIF_DOWNLOAD_KEY' in os.environ:
# Submit query to GBIF
= occ.download([
gbif_query f"speciesKey = {species_key}",
"hasCoordinate = TRUE",
"year = {year}",
])'GBIF_DOWNLOAD_KEY'] = gbif_query[0]
os.environ[
# Wait for the download to build
= os.environ['GBIF_DOWNLOAD_KEY']
download_key = occ.download_meta(download_key)['status']
wait while not wait=='SUCCEEDED':
= occ.download_meta(download_key)['status']
wait 5)
time.sleep(
# Download GBIF data
= occ.download_get(
download_info 'GBIF_DOWNLOAD_KEY'],
os.environ[=project.project_dir)
path
# Unzip GBIF data
with zipfile.ZipFile(download_info['path']) as download_zip:
=project.project_dir)
download_zip.extractall(path
# Clean up the .zip file
'path'])
shutil.rmtree(download_info[
# Find the extracted .csv file path (take the first result)
= glob(str(project.project_dir / '*.csv'))[0]
original_gbif_path original_gbif_path
'/home/runner/.local/share/earth-analytics/tasiyagnunpa-migration-2023/gbif_tasiyagnunpa.csv'
You might notice that the GBIF data filename isn’t very descriptive…at this point, you may want to clean up your data directory so that you know what the file is later on!
- Replace ‘your-gbif-filename-here’ with a descriptive name.
- Run the cell
- Check your data folder. Is it organized the way you want?
# Give the download a descriptive name
= project.project_dir / 'your-gbif-filename-here'
gbif_path
shutil.move(original_gbif_path, gbif_path)# Clean up
'path']) shutil.rmtree(download_info[
See our solution!
# Give the download a descriptive name
= project.project_dir / gbif_filename
gbif_path shutil.move(original_gbif_path, gbif_path)
PosixPath('/home/runner/.local/share/earth-analytics/tasiyagnunpa-migration-2023/gbif_tasiyagnunpa.csv')
Load the GBIF data into Python
- Look at the beginning of the file you downloaded using the code below. What do you think the delimiter is?
- Run the following code cell. What happens?
- Uncomment and modify the parameters of
pd.read_csv()
below until your data loads successfully and you have only the columns you want.
You can use the following code to look at the beginning of your file:
!head -n 2 $gbif_path
gbifID datasetKey occurrenceID kingdom phylum class order family genus species infraspecificEpithet taxonRank scientificName verbatimScientificName verbatimScientificNameAuthorship countryCode locality stateProvince occurrenceStatus individualCount publishingOrgKey decimalLatitude decimalLongitude coordinateUncertaintyInMeters coordinatePrecision elevation elevationAccuracy depth depthAccuracy eventDate day month year taxonKey speciesKey basisOfRecord institutionCode collectionCode catalogNumber recordNumber identifiedBy dateIdentified license rightsHolder recordedBy typeStatus establishmentMeans lastInterpreted mediaType issue
4501319588 2f54cb88-4167-499a-81fb-0a2d02465212 http://arctos.database.museum/guid/DMNS:Bird:57539?seid=6172480 Animalia Chordata Aves Passeriformes Icteridae Sturnella Sturnella neglecta SPECIES Sturnella neglecta Audubon, 1844 Sturnella neglecta US Fort Collins, 6888 East County Road 56 Colorado PRESENT a2ef6dd1-8886-48c9-8025-c62bac973cc7 40.657779 -104.94913 80.0 1609.0 0.0 2023-05-30 30 5 2023 9596413 9596413 PRESERVED_SPECIMEN DMNS Bird DMNS:Bird:57539 Greenwood Wildlife Rehabilitation Center 2023-05-30T00:00:00 CC_BY_NC_4_0 Collector(s): Greenwood Wildlife Rehabilitation Center 2025-02-06T17:47:06.161Z COORDINATE_ROUNDED;CONTINENT_DERIVED_FROM_COORDINATES;INSTITUTION_MATCH_FUZZY;COLLECTION_MATCH_FUZZY
# Load the GBIF data
= pd.read_csv(
gbif_df
gbif_path, ='',
delimiter='',
index_col=[]
usecols
) gbif_df.head()
See our solution!
# Load the GBIF data
= pd.read_csv(
gbif_df
gbif_path, ='\t',
delimiter='gbifID',
index_col=['gbifID', 'decimalLatitude', 'decimalLongitude', 'month'])
usecols gbif_df.head()
decimalLatitude | decimalLongitude | month | |
---|---|---|---|
gbifID | |||
4501319588 | 40.657779 | -104.949130 | 5 |
4501319649 | 40.266835 | -105.163977 | 7 |
4697139297 | 31.569170 | -109.700950 | 2 |
4735897257 | 40.582947 | -102.277350 | 4 |
4719794206 | 39.266953 | -104.515920 | 6 |