Introduction to Spatial Vector Data File Formats in Open Source Python
Vector data is one of the two most common spatial data types, representing points, lines, and polygons. You can use vector data for earth data sciencewith the Geopandas package in Python.
Learning Goals:
true
Open a shapefile in Python using Geopandas - gpd.read_file().
Plot a shapfile in Python using Geopandas - gpd.plot().
Authors
Leah Wasser
Nathan Korinek
Elsa Culler
After completing this lesson, you will be able to:
Describe the characteristics of 3 key vector data structures: points, lines and polygons.
Open a shapefile in Python using Geopandas - gpd.read_file().
Plot a shapfile in Python using Geopandas - gpd.plot().
Read More
This lesson is an introduction to working with spatial data. If you wish to dive more deeply into working with spatial data, check out the intermediate earth data science textbook chapter on spatial vector data.
About Spatial Vector Data
Vector data are composed of discrete geometric locations (x, y values) known as vertices that define the “shape” of the spatial object. The organization of the vertices determines the type of vector that you are working with. There are three types of vector data:
Points: Each individual point is defined by a single x, y coordinate. Examples of point data include: sampling locations, the location of individual trees or the location of plots.
Lines: Lines are composed of many (at least 2) vertices, or points, that are connected. For instance, a road or a stream may be represented by a line. This line is composed of a series of segments, each “bend” in the road or stream represents a vertex that has defined x, y location.
Polygons: A polygon consists of 3 or more vertices that are connected and “closed”. Thus, the outlines of plot boundaries, lakes, oceans, and states or countries are often represented by polygons.
There are 3 types of vector objects: points, lines or polygons. Each object type has a different structure. Image Source: Colin Williams (NEON)
Introduction to the Shapefile File Format Which Stores Points, Lines, and Polygons
Geospatial data in vector format are often stored in a shapefile format (.shp). Because the structure of points, lines, and polygons are different, each individual shapefile can only contain one vector type (all points, all lines or all polygons). You will not find a mixture of point, line and polygon objects in a single shapefile.
Objects stored in a shapefile often have a set of associated attributes that describe the data. For example, a line shapefile that contains the locations of streams might also contain the associated stream name, stream “order” and other information about each stream line object.
The shapefile is not the only way that vector data are stored. Geospatial data can also be delivered in a GeoJSON format, or even a tabular format where the spatial information is contained in columns.
What Data Are Stored In Spatial Vector Formats?
Some examples of data that often are provided in a vector format include:
census data including municipal boundaries
roads, powerlines and other infrastructure boundaries
political boundaries
building outlines
water bodies and river systems
ecological boundaries
city locations
object locations including plots, stream gages, and building locations
Import Shapefile Data Into Python Using Geopandas
You will use the geopandas library to work with vector data in Python. Geopandas is built on top of the PythonPandas library. It stores spatial data in a tabular, dataframe format.
To begin, set your working directory to earth-analytics and then download a single shapefile. You will start with working with the Natural Earth country boundary lines layer.
Note that below you are using EarthPy to download a dataset from naturalearthdata.com. EarthPy creates the earth-analytics directory for you when you use it. You set the working directory after you download the data as a precaution to ensure that the earth-analytics directory already exists on your computer. This is not a standard order of operations but we are demonstrating it here to ensure the notebook runs on all computers!
# Import packagesimport osimport matplotlib.pyplot as pltimport geopandas as gpdimport earthpy as et# Download the data and set working directoryboundary_url=("https://naciscdn.org/naturalearth/50m/cultural""/ne_50m_admin_0_boundary_lines_land.zip")et.data.get_data(url=boundary_url)
Downloading from https://naciscdn.org/naturalearth/50m/cultural/ne_50m_admin_0_boundary_lines_land.zip
Extracted output to /home/runner/earth-analytics/data/ne_50m_admin_0_boundary_lines_land
Next, you open the data using Geopandas. You can view the first 5 rows of the data using .head() in the same way you used .head() for Pandas dataframes.
Downloading from https://naciscdn.org/naturalearth/50m/physical/ne_50m_coastline.zip
Extracted output to /home/runner/earth-analytics/data/ne_50m_coastline
ERROR 1: PROJ: proj_create_from_database: Open of /usr/share/miniconda/envs/learning-portal/share/proj failed
scalerank
featurecla
min_zoom
geometry
0
0
Coastline
1.5
LINESTRING (180 -16.15293, 179.84814 -16.21426...
1
0
Coastline
4.0
LINESTRING (177.25752 -17.0542, 177.2874 -17.0...
2
0
Coastline
4.0
LINESTRING (127.37266 0.79131, 127.35381 0.847...
3
0
Coastline
3.0
LINESTRING (-81.32231 24.68506, -81.42007 24.7...
4
0
Coastline
4.0
LINESTRING (-80.79941 24.84629, -80.83887 24.8...
GeoPandas Creates GeoDataFrames Which Have the Same Structure As Pandas DataFrames
The structure of a GeopandasGeoDataFrame is very similar to a Pandas dataframe. A few differences include:
the GeoDataFrame contains a geometry column which stores spatial information. The geometry column in your GeoDataFrame stores the boundary information (the lines that make up each shape in your data). This allows you to plot points, lines or polygons.
theGeoDataFrame stores spatial attributes such as coordinate reference systems and spatial extents.
Similar to Pandas, you can plot the data using .plot()
# Plot the dataf, ax1 = plt.subplots(figsize=(12, 6))coastlines.plot(ax=ax1)# Add a title to your plotax1.set(title="Global Coastline Boundaries")plt.show()
Check the Spatial Vector Data Type
You can look at the data to figure out what type of data are stored in the shapefile (points, line or polygons). However, you can also get that information by calling .geom_type
# Is the geometry type point, line or polygon?coastlines.geom_type
Also similar to Pandas, you can view descriptive information about the GeoDataFrame using .info(). This includes the number of columns, rows and the header name and type of each column.
Next, you will open up another shapefile using Geopandas.
# Open populated placesplaces_url=('https://naciscdn.org/naturalearth/50m/cultural''/ne_50m_populated_places_simple.zip')places_path = et.data.get_data(url=places_url)# Read the populated places shapefilecities = gpd.read_file(places_path)cities.head()
Downloading from https://naciscdn.org/naturalearth/50m/cultural/ne_50m_populated_places_simple.zip
Extracted output to /home/runner/earth-analytics/data/ne_50m_populated_places_simple
scalerank
natscale
labelrank
featurecla
name
namepar
namealt
nameascii
adm0cap
capalt
...
pop_max
pop_min
pop_other
rank_max
rank_min
meganame
ls_name
min_zoom
ne_id
geometry
0
10
1
5
Admin-1 region capital
Bombo
None
None
Bombo
0
0
...
75000
21000
0.0
8
7
None
None
7.0
1159113923
POINT (32.5333 0.5833)
1
10
1
5
Admin-1 region capital
Fort Portal
None
None
Fort Portal
0
0
...
42670
42670
0.0
7
7
None
None
7.0
1159113959
POINT (30.275 0.671)
2
10
1
3
Admin-1 region capital
Potenza
None
None
Potenza
0
0
...
69060
69060
0.0
8
8
None
None
7.0
1159117259
POINT (15.799 40.642)
3
10
1
3
Admin-1 region capital
Campobasso
None
None
Campobasso
0
0
...
50762
50762
0.0
8
8
None
None
7.0
1159117283
POINT (14.656 41.563)
4
10
1
3
Admin-1 region capital
Aosta
None
None
Aosta
0
0
...
34062
34062
0.0
7
7
None
None
7.0
1159117361
POINT (7.315 45.737)
5 rows × 32 columns
Try It
What Geometry Type Are Your Data? Check the geometry type of the cities object that you opened above in your code.
# Add the code here to check the geometry type of the cities objectcities.geom_type
0 Point
1 Point
2 Point
3 Point
4 Point
...
1246 Point
1247 Point
1248 Point
1249 Point
1250 Point
Length: 1251, dtype: object
Creating Maps Using Multiple Shapefiles
You can create maps using multiple shapefiles with Geopandas in a similar way that you may do so using a graphical user interface (GUI) tool like ArcGIS or QGIS (open source alternative to ArcGIS). To do this you will need to open a second spatial file. Below you will use the Natural Earth populated places shapefile to add additional layers to your map.
To plot two datasets together, you will first create a Matplotlib figure object. Notice in the example below that you define the figure ax1 in the first line. You then tell GeoPandas to plot the data on that particular figure using the parameter ax=
To add another layer to your map, you can add a second .plot() call and specify the ax= to be ax1 again. This tells Python to layer the two datasets in the same figure.
# Create a map or plot with two data layersf, ax1 = plt.subplots(figsize=(10, 6))coastlines.plot(ax=ax1, color="black")cities.plot(ax=ax1)# Add a titleax1.set(title="Map of Cities and Global Lines")plt.show()
If you don’t specify the axis when you plot using ax=, each layer will be plotted on a separate figure! See the example below.
The code below will download one additional file for you that contains global country boundaries.
Try It
Your goal is to create a map that contains 3 layers:
the cities or populated places layer that you opened above
the coastlines layer that you opened above and
the countries layer that you will open using the code below
To create a map with the three layers, you can: 1. Copy the code below that downloads the countries layer. 2. Next, use Geopandasread_file() to open the countries layer as a GeoDataFrame. 3. Create a map of all three layers - in the same figure using the same ax=. The countries should be the bottom layer, and the cities and lines should be on top of that layer.
country_data_url = ("https://naciscdn.org/naturalearth""/50m/cultural/ne_50m_admin_0_countries.zip")country_path = et.data.get_data(url=country_data_url)# Load the countries datacountries = gpd.read_file(country_path)
Downloading from https://naciscdn.org/naturalearth/50m/cultural/ne_50m_admin_0_countries.zip
Extracted output to /home/runner/earth-analytics/data/ne_50m_admin_0_countries
Looking for an Extra Challenge?
Customize your map as follows:
Adjust the linewidth of lines with linewidth=4
Adjust the edge color of polygons using: edgecolor="grey"
Adjust the color of your objects (the line color, or point color) using: color='springgreen'.
Finally, add a title to your map using ax1.set(title="my title here")
# Create a map or plot with two data layersf, ax1 = plt.subplots(figsize=(18, 12))coastlines.plot(ax=ax1, color="grey", edgecolor="grey", linewidth=1)countries.plot(ax=ax1, color="white", edgecolor="grey", linewidth=1)cities.plot(ax=ax1, color="springgreen")# Add a titleax1.set(title="Map of Cities, Countries, and Coastlines")plt.show()
Read More
There are many options to customize plots in Python. Below are a few lessons that cover some of this information!
Each object in a shapefile has one or more attributes associated with it. Shapefile attributes are similar to fields or columns in a spreadsheet. Each row in the spreadsheet has a set of columns associated with it that describe the row element. In the case of a shapefile, each row represents a spatial object - for example, a road, represented as a line in a line shapefile, will have one “row” of attributes associated with it.
Each spatial feature in a spatial object has the same set of associated attributes that describe or characterize the feature. Attribute data are stored in a separate *.dbf file. Attribute data can be compared to a spreadsheet. Each row in a spreadsheet represents one feature in the spatial object. Image Source: National Ecological Observatory Network (NEON)
The attributes for a shapefile imported into a GeoDataFrame can be viewed in the GeoDataFrame itself.
# View first 5 rows of GeoDataFramecities.head()
scalerank
natscale
labelrank
featurecla
name
namepar
namealt
nameascii
adm0cap
capalt
...
pop_max
pop_min
pop_other
rank_max
rank_min
meganame
ls_name
min_zoom
ne_id
geometry
0
10
1
5
Admin-1 region capital
Bombo
None
None
Bombo
0
0
...
75000
21000
0.0
8
7
None
None
7.0
1159113923
POINT (32.5333 0.5833)
1
10
1
5
Admin-1 region capital
Fort Portal
None
None
Fort Portal
0
0
...
42670
42670
0.0
7
7
None
None
7.0
1159113959
POINT (30.275 0.671)
2
10
1
3
Admin-1 region capital
Potenza
None
None
Potenza
0
0
...
69060
69060
0.0
8
8
None
None
7.0
1159117259
POINT (15.799 40.642)
3
10
1
3
Admin-1 region capital
Campobasso
None
None
Campobasso
0
0
...
50762
50762
0.0
8
8
None
None
7.0
1159117283
POINT (14.656 41.563)
4
10
1
3
Admin-1 region capital
Aosta
None
None
Aosta
0
0
...
34062
34062
0.0
7
7
None
None
7.0
1159117361
POINT (7.315 45.737)
5 rows × 32 columns
# View data just in the pop_max column of the cities objectcities.pop_max
The spatial and attribute data are not the only important aspects of a shapefile. The metadata of a shapefile are also very important. The metadata includes data on the Coordinate Reference System (CRS), the extent, and much more. For more information on what the metadata is, and how to access it, see the full lesson on vector data on the Earth Lab website, here.
# View the geom_type and the first few rowsprint(cities.geom_type)display(cities.head())f, ax1 = plt.subplots(figsize=(18, 12))cities.plot(ax=ax1, color="purple")
0 Point
1 Point
2 Point
3 Point
4 Point
...
1246 Point
1247 Point
1248 Point
1249 Point
1250 Point
Length: 1251, dtype: object
scalerank
natscale
labelrank
featurecla
name
namepar
namealt
nameascii
adm0cap
capalt
...
pop_max
pop_min
pop_other
rank_max
rank_min
meganame
ls_name
min_zoom
ne_id
geometry
0
10
1
5
Admin-1 region capital
Bombo
None
None
Bombo
0
0
...
75000
21000
0.0
8
7
None
None
7.0
1159113923
POINT (32.5333 0.5833)
1
10
1
5
Admin-1 region capital
Fort Portal
None
None
Fort Portal
0
0
...
42670
42670
0.0
7
7
None
None
7.0
1159113959
POINT (30.275 0.671)
2
10
1
3
Admin-1 region capital
Potenza
None
None
Potenza
0
0
...
69060
69060
0.0
8
8
None
None
7.0
1159117259
POINT (15.799 40.642)
3
10
1
3
Admin-1 region capital
Campobasso
None
None
Campobasso
0
0
...
50762
50762
0.0
8
8
None
None
7.0
1159117283
POINT (14.656 41.563)
4
10
1
3
Admin-1 region capital
Aosta
None
None
Aosta
0
0
...
34062
34062
0.0
7
7
None
None
7.0
1159117361
POINT (7.315 45.737)
5 rows × 32 columns
You can plot your data according to categorical groups similar to what you might do in a tool like ArcGIS or QGIS.
Try It
See what happens when you customize your plot code above.
Optional: Geoprocessing Vector Data Geoprocessing in Python: Clip Data
Sometimes you have spatial data for a larger area than you need to process. For example you may be working on a project for your state or country. But perhaps you have data for the entire globe.
You can clip the data spatially to another boundary to make it smaller. Once the data are clipped, your processing operations will be faster. It will also make creating maps of your study area easier and cleaner.
When you clip a vector data set with another layer, you remove points, lines or polygons that are outside of the spatial extent of the area that you use to clip the data. This images shows a circular clip region - you will be using a rectangular region in this example. Image Source: ESRI
In the bonus challenge below, you will clip the cities point data to the boundary of a single country. This will make the data smaller and easier to use.
you subset the countries layer to just the boundary of the United State of America
you then plot the data to look at the newly subsetted data!
Finally you clip the cities data to only include cities that fall within the boundary of the United States
# Subset the countries data to just a singleunited_states_boundary = ( countries.loc[countries['SOVEREIGNT']=='United States of America'])# Notice in the plot below, that only the boundary for the USA is in the new variablef, ax = plt.subplots(figsize=(10, 6))united_states_boundary.plot(ax=ax)plt.show()
# Clip the cities data to the USA boundary# Note -- this operation may take some time to run - be patientcities_in_usa = gpd.clip(cities, united_states_boundary)# Plot your final clipped dataf, ax = plt.subplots()cities_in_usa.plot(ax=ax)ax.set(title="Cities clipped to the USA Boundary")plt.show()
Looking for an Extra Challenge?
For this bonus challenge, you will clip the cities data that you opened above to the extent of a single country - Canada.
gpd.clip(data_to_clip, boundary_to_clip_to).
Follow the example above to perform your clip operation.
To perform the clip you need the following:
Create the boundary for Canada by subsetting the countries object
Clip the cities layer to the Canada boundary layer
Plot your final cities clipped layer!
When you clip, all of the spatial data outside of the spatial clip boundary (Canada) will not be included in the output dataset.
# Subset the countries data to just a singlecanada_boundary = ( countries.loc[countries['SOVEREIGNT'] =='Canada'])# Below this line clip the cities data to the Canada boundary that you created above.cities_in_canada = gpd.clip(cities, canada_boundary)
# Plot your final clipped data heref, ax = plt.subplots()cities_in_canada.plot(ax=ax)ax.set(title="Cities clipped to the Canada Boundary")plt.show()