Unlocking the Future of Geospatial Insights
In a world driven by data, where location is the linchpin connecting disparate information streams, the synergy between Databricks, Apache Spark, and Esri’s Geoanalytics Engine emerges as a formidable force—a triumvirate reshaping the very landscape of geospatial analytics. The era of data-driven decision-making has dawned upon us, and at its epicenter lies the power to harness location intelligence for unparalleled insights.
Industries spanning from urban planning and logistics to healthcare and environmental conservation now stand at the threshold of transformative change, where the convergence of cutting-edge technology promises not only efficiency but visionary innovation. Databricks, Spark, and Geoanalytics Engine unite, setting forth a new frontier in geospatial analysis—a fusion that is nothing short of a geospatial dream team.
The Power of Databricks and Spark
Through this article, we’ll reveal how to reduce the complexities and conquer challenges that are associated with geospatial data through the power of Databricks powered by Spark.
For those unfamiliar with Databricks, they define themselves perfectly:
“Databricks is a unified, open analytics platform for building, deploying, sharing, and maintaining enterprise-grade data, analytics, and AI solutions at scale.”
Spark’s distributed data processing capabilities are exemplified by its ability to parallelize data across clusters, allowing for high-speed data processing in a distributed fashion. It achieves this through its Resilient Distributed Dataset (RDD) abstraction and optimized query execution, making it a versatile and efficient choice for handling large-scale data processing tasks. Apache Spark has some geospatial SQL query options built-in, but for our purposes, we are going to expand on this by integrating a new tool from Esri: GeoAnalytics Engine.
Esri’s GeoAnalytics Engine
ArcGIS GeoAnalytics Engine is an extension on these capabilities that provides easy-to-use APIs that accelerate the process and integrate your ArcGIS platform further into the Databricks unified platform. This allows the user to access the scalability and <benefits> of Databricks, while seamlessly integrating to-and-from feature services within Esri’s ArcGIS platform.
It also provides support for traditional files, such as CSV and Shapefiles, common in GIS applications. Additionally, it supports other sources, such as Apache Parquet, and its geospatially-enabled Apache GeoParquet, which both provide massive benefits in terms of storage, querying, and read-write speeds.
Benefits of Integration
On top of data integration, GeoAnalytics Engine (GAE) provides two other primary modules: the SQL module (geoanalytics.sql.functions) and the Tools module (geoanalytics.tools). Users of other ArcGIS APIs and tools will recognize the familiarity in many of the tools provided, such as SummarizeWithin, a common statistical aggregation methodology in GIS, now with added optimization from the Spark engine on Databricks.
## Code comparison here, GAE vs ArcGIS Pro:
result = SummarizeWithin() \
.setSummaryBins(bin_size=200, bin_size_unit="Kilometers", bin_type='hexagon') \
.includeShapeSummary(include=True, units="Kilometers") \
.run(dataframe=rt_result)
Use Cases
Urban Planning and Development:
- Analyzing geospatial data to optimize urban planning, including land use, transportation networks, and infrastructure development
- Predicting population growth patterns to guide city expansion and resource allocation
Logistics and Supply Chain Optimization:
- Real-time tracking and optimization of delivery routes for logistics companies
- Inventory management and demand forecasting considering spatial factors
Natural Resource Management:
- Monitoring and managing natural resources, such as forests, water bodies, and agricultural land
- Predicting wildfires, analyzing deforestation, and ensuring sustainable resource usage
Environmental Monitoring:
- Analyzing environmental data to monitor air and water quality, climate change, and biodiversity
- Identifying pollution sources and assessing their impact
Emergency Response and Disaster Management:
- Real-time mapping and visualization of natural disasters like hurricanes, earthquakes, and floods
- Predictive modeling for disaster preparedness and response planning
Retail and Location Intelligence:
- Analyzing customer foot traffic and behavior in brick-and-mortar stores
- Location-based marketing and personalized recommendations
Agriculture and Precision Farming:
- Remote sensing and geospatial analysis for precision agriculture, including crop yield prediction and disease detection
- Optimizing irrigation and fertilizer usage based on spatial data
Healthcare and Epidemiology:
- Tracking disease outbreaks and analyzing spatial patterns to allocate healthcare resources
- Identifying high-risk areas for public health interventions
Energy and Utilities Management:
- Managing energy grids efficiently by analyzing geospatial data on power distribution and consumption
- Identifying optimal locations for renewable energy installations
Real Estate and Property Assessment:
- Automated property valuation using geospatial data on property characteristics and neighborhood factors
- Land parcel analysis for zoning and development decisions
Telecommunications and Network Planning:
- Optimizing network coverage and capacity by analyzing terrain and population density
- Predictive maintenance of network infrastructure
Financial Services:
- Fraud detection and risk assessment based on location data
- Geospatial analysis for assessing investment opportunities in real estate and infrastructure projects
Smart Cities and IoT Integration:
- Integrating IoT sensor data with geospatial analytics to create smarter and more efficient cities
- Traffic management, waste management, and smart grid deployment
Mineral Exploration and Mining:
- Identifying potential mining sites through geospatial analysis of geological data
- Monitoring and optimizing mining operations for resource extraction
Defense and Security:
- Geospatial intelligence (GEOINT) for national security and military applications
- Border surveillance, threat detection, and mission planning
The Data Science Process
The “unified” aspect of the platform is truly critical for geospatial data science through nearly every step. (You’ll still have to work with clients vigorously to gain that business understanding before you get started! Good luck!)
The Data
Databricks provides a powerful platform for exploring and understanding geospatial data. You can leverage its interactive notebooks to visualize geospatial datasets, perform statistical analyses, and gain insights into the structure and characteristics of your data. Paired with Geoanalytics Engine’s translative SQL functions, such as geospatial format conversions, understanding your data, exploring your data, and creating a consistent, fast data preparation ETL is simplified.
from geoanalytics.sql import functions as ST:
point_geojson = '{"type": "Point","coordinates": [-7472618.18,5189924.02]}'
line_geojson = '{"type": "LineString","coordinates": [[-7489594.84,5178779.67],[-7474281.07,5176558.51],[-7465977.43,5179778.83]]}'
df = spark.createDataFrame([(point_geojson,),(line_geojson,)],["geojson"])
df.select(ST.geom_from_geojson("geojson", sr=8857).alias("geom_from_geojson")).show(truncate=False)
The Data Product
Whether developing a machine learning model with Apache MLLib, or a reusable product or script, you can use Databricks to create and host, and provide access in various ways, such as through REST APIs. After creating a workflow, you can run evaluation metrics in-notebook, validating simple processes ad hoc, and evaluating models through more ML specific means, such as MLflow. In GIS and with ArcGIS, this will likely look more like continuous or periodic updates to hosted feature layers through GAE’s geospatial read-write functionality.
The Data Visualization
Esri’s GeoAnalytics Engine provides st.plot as an extension of matplotlib to allow in-notebook visualization inside of Databricks. My personal favorite visualization and customization tool is ArcGIS Pro, however. The ability to plot things quickly in-notebook is extremely valuable during product creation, while the granularity and feature layer control provided in ArcGIS Pro’s symbology customization provides the ability to deliver jaw-dropping maps that are sure to send clients and investors to the moon in awe. You can find code snippets here for visualization.
import geoanalytics
# Get states polygons
url = "https://services.arcgis.com/P3ePLMYs2RVChkJx/ArcGIS/rest/services/USA_States_Generalized/FeatureServer/0"
states = spark.read.format("feature-service").load(url).where("STATE_ABBR NOT IN ('AK','HI')")
# Plot states
states.st.plot(cmap_values="POPULATION", figsize=(15, 8), legend=True, legend_kwds={"label": "Population"})
Performance and Scalability
At GEO Jobe, we’ve been able to apply this process to several geospatial use cases at large scale. Nothing shy of the coined “bigData,” we were able to leverage this powerhouse tech stack to turn processes that were nothing short of pipe dreams into reality. Weeks and days of processing optimized and scaled to be hours and minutes.
A Short List of Optimizations
Cluster Configuration and Scaling: “Databricks provides a number of options when you create and configure clusters to help you get the best performance at the lowest cost.”
Data Partitioning: Selecting the correct partition is critical. Typically, you want 2-4 partitions for each CPU in your cluster. Partitioning on a specific, well-chosen column ensures that only the necessary data is processed, reducing overhead and improving query performance.
Predicate Pushdown: Filter and reduce the data as early as possible in the query execution plan. Parquet files are supported by Spark for predicate pushdown.
Optimized File Formats: Parquet has repeatedly proved to be a superior format, when applicable. There are more integrations that need to happen in the future to ease the process, however.
Dataset | Size on Amazon S3 | Query Run Time | Data Scanned | Cost |
Data stored as CSV files | 1 TB | 236 seconds | 1.15 TB | $5.75 |
Data stored in Apache Parquet Format | 130 GB | 6.78 seconds | 2.51 GB | $0.01 |
Savings | 87% less when using Parquet | 34x faster | 99% less data scanned | 99.7% savings |
The importance of Optimized File Formats (Source: Databricks)
Future Trends
Geoanalytics Engine as a relatively new tool already gives us lots of needed functionality, with more being added as recently as 1.2. Two tools that we hope to see linked in the near future are ArcGIS Geoanalytics Engine (powered by Databricks) and Spark (paired with ArcGIS Knowledge), with theirability to generate knowledge graphs and graph database insights through entity-relationship data.
Conclusion
Databricks as a unified platform, Spark as it’s engine, and GeoAnalytics Engine in conjunction prove to be a highly scalable, highly performant geospatial trinity. GAE is an emerging technology early in development stages with much promise to come. As more industries utilize these technologies for GIS, the challenges and shortcomings will continue to dissipate, further integrations will arise, and more resources will emerge.
We encourage GIS users to continue to monitor these tools, and if you’re not already utilizing GIS Data Science reach out to the GEO Jobe Data Science team via email at connect@geo-jobe.com!
Bonus Resources
- Get Started with GAE
- Install GeoAnalytics Engine
- Data Science Process
- Seaborn Library
- GAE in Databricks