Powering Geoanalytics with Databricks and Spark Integration

Unlocking the Future of Geospatial Insights

In a world driven by data, where location is the linchpin connecting disparate information streams, the synergy between Databricks, Apache Spark, and Esri’s Geoanalytics Engine emerges as a formidable force—a triumvirate reshaping the very landscape of geospatial analytics. The era of data-driven decision-making has dawned upon us, and at its epicenter lies the power to harness location intelligence for unparalleled insights.

Industries spanning from urban planning and logistics to healthcare and environmental conservation now stand at the threshold of transformative change, where the convergence of cutting-edge technology promises not only efficiency but visionary innovation. Databricks, Spark, and Geoanalytics Engine unite, setting forth a new frontier in geospatial analysis—a fusion that is nothing short of a geospatial dream team.

The Power of Databricks and Spark

Through this article, we’ll reveal how to reduce the complexities and conquer challenges that are associated with geospatial data through the power of Databricks powered by Spark.

For those unfamiliar with Databricks, they define themselves perfectly:

“Databricks is a unified, open analytics platform for building, deploying, sharing, and maintaining enterprise-grade data, analytics, and AI solutions at scale.”

Spark’s distributed data processing capabilities are exemplified by its ability to parallelize data across clusters, allowing for high-speed data processing in a distributed fashion. It achieves this through its Resilient Distributed Dataset (RDD) abstraction and optimized query execution, making it a versatile and efficient choice for handling large-scale data processing tasks. Apache Spark has some geospatial SQL query options built-in, but for our purposes, we are going to expand on this by integrating a new tool from Esri: GeoAnalytics Engine.

Esri’s GeoAnalytics Engine

ArcGIS GeoAnalytics Engine is an extension on these capabilities that provides easy-to-use APIs that accelerate the process and integrate your ArcGIS platform further into the Databricks unified platform. This allows the user to access the scalability and <benefits> of Databricks, while seamlessly integrating to-and-from feature services within Esri’s ArcGIS platform.

It also provides support for traditional files, such as CSV and Shapefiles, common in GIS applications. Additionally, it supports other sources, such as Apache Parquet, and its geospatially-enabled Apache GeoParquet, which both provide massive benefits in terms of storage, querying, and read-write speeds. 

Benefits of Integration

On top of data integration, GeoAnalytics Engine (GAE) provides two other primary modules: the SQL module (geoanalytics.sql.functions) and the Tools module (geoanalytics.tools). Users of other ArcGIS APIs and tools will recognize the familiarity in many of the tools provided, such as SummarizeWithin, a common statistical aggregation methodology in GIS, now with added optimization from the Spark engine on Databricks.

## Code comparison here, GAE vs ArcGIS Pro:

result = SummarizeWithin() \
.setSummaryBins(bin_size=200, bin_size_unit="Kilometers", bin_type='hexagon') \
.includeShapeSummary(include=True, units="Kilometers") \
.run(dataframe=rt_result)

Use Cases

Urban Planning and Development:

  • Analyzing geospatial data to optimize urban planning, including land use, transportation networks, and infrastructure development
  • Predicting population growth patterns to guide city expansion and resource allocation

Logistics and Supply Chain Optimization:

  • Real-time tracking and optimization of delivery routes for logistics companies
  • Inventory management and demand forecasting considering spatial factors

Natural Resource Management:

  • Monitoring and managing natural resources, such as forests, water bodies, and agricultural land
  • Predicting wildfires, analyzing deforestation, and ensuring sustainable resource usage

Environmental Monitoring:

  • Analyzing environmental data to monitor air and water quality, climate change, and biodiversity
  • Identifying pollution sources and assessing their impact

Emergency Response and Disaster Management:

  • Real-time mapping and visualization of natural disasters like hurricanes, earthquakes, and floods
  • Predictive modeling for disaster preparedness and response planning

Retail and Location Intelligence:

  • Analyzing customer foot traffic and behavior in brick-and-mortar stores
  • Location-based marketing and personalized recommendations

Agriculture and Precision Farming:

  • Remote sensing and geospatial analysis for precision agriculture, including crop yield prediction and disease detection
  • Optimizing irrigation and fertilizer usage based on spatial data

Healthcare and Epidemiology:

  • Tracking disease outbreaks and analyzing spatial patterns to allocate healthcare resources
  • Identifying high-risk areas for public health interventions

Energy and Utilities Management:

  • Managing energy grids efficiently by analyzing geospatial data on power distribution and consumption
  • Identifying optimal locations for renewable energy installations

Real Estate and Property Assessment:

  • Automated property valuation using geospatial data on property characteristics and neighborhood factors
  • Land parcel analysis for zoning and development decisions

Telecommunications and Network Planning:

  • Optimizing network coverage and capacity by analyzing terrain and population density
  • Predictive maintenance of network infrastructure

Financial Services:

  • Fraud detection and risk assessment based on location data
  • Geospatial analysis for assessing investment opportunities in real estate and infrastructure projects

Smart Cities and IoT Integration:

  • Integrating IoT sensor data with geospatial analytics to create smarter and more efficient cities
  • Traffic management, waste management, and smart grid deployment

Mineral Exploration and Mining:

  • Identifying potential mining sites through geospatial analysis of geological data
  • Monitoring and optimizing mining operations for resource extraction

Defense and Security:

  • Geospatial intelligence (GEOINT) for national security and military applications
  • Border surveillance, threat detection, and mission planning

The Data Science Process

The “unified” aspect of the platform is truly critical for geospatial data science through nearly every step. (You’ll still have to work with clients vigorously to gain that business understanding before you get started! Good luck!)

The Data

Databricks provides a powerful platform for exploring and understanding geospatial data. You can leverage its interactive notebooks to visualize geospatial datasets, perform statistical analyses, and gain insights into the structure and characteristics of your data. Paired with Geoanalytics Engine’s translative SQL functions, such as geospatial format conversions, understanding your data, exploring your data, and creating a consistent, fast data preparation ETL is simplified.

from geoanalytics.sql import functions as ST:

point_geojson = '{"type": "Point","coordinates": [-7472618.18,5189924.02]}'
line_geojson = '{"type": "LineString","coordinates": [[-7489594.84,5178779.67],[-7474281.07,5176558.51],[-7465977.43,5179778.83]]}'
df = spark.createDataFrame([(point_geojson,),(line_geojson,)],["geojson"])
df.select(ST.geom_from_geojson("geojson", sr=8857).alias("geom_from_geojson")).show(truncate=False)

The Data Product

Whether developing a machine learning model with Apache MLLib, or a reusable product or script, you can use Databricks to create and host, and provide access in various ways, such as through REST APIs. After creating a workflow, you can run evaluation metrics in-notebook, validating simple processes ad hoc, and evaluating models through more ML specific means, such as MLflow. In GIS and with ArcGIS, this will likely look more like continuous or periodic updates to hosted feature layers through GAE’s geospatial read-write functionality.

The Data Visualization

Esri’s GeoAnalytics Engine provides st.plot as an extension of matplotlib to allow in-notebook visualization inside of Databricks. My personal favorite visualization and customization tool is ArcGIS Pro, however. The ability to plot things quickly in-notebook is extremely valuable during product creation, while the granularity and feature layer control provided in ArcGIS Pro’s symbology customization provides the ability to deliver jaw-dropping maps that are sure to send clients and investors to the moon in awe. You can find code snippets here for visualization.

import geoanalytics
# Get states polygons
url = "https://services.arcgis.com/P3ePLMYs2RVChkJx/ArcGIS/rest/services/USA_States_Generalized/FeatureServer/0"
states = spark.read.format("feature-service").load(url).where("STATE_ABBR NOT IN ('AK','HI')")
# Plot states
states.st.plot(cmap_values="POPULATION", figsize=(15, 8), legend=True, legend_kwds={"label": "Population"})

Performance and Scalability

At GEO Jobe, we’ve been able to apply this process to several geospatial use cases at large scale. Nothing shy of the coined “bigData,” we were able to leverage this powerhouse tech stack to turn processes that were nothing short of pipe dreams into reality. Weeks and days of processing optimized and scaled to be hours and minutes. 

A Short List of Optimizations

Cluster Configuration and Scaling: “Databricks provides a number of options when you create and configure clusters to help you get the best performance at the lowest cost.”

Data Partitioning: Selecting the correct partition is critical. Typically, you want 2-4 partitions for each CPU in your cluster. Partitioning on a specific, well-chosen column ensures that only the necessary data is processed, reducing overhead and improving query performance.

Predicate Pushdown:  Filter and reduce the data as early as possible in the query execution plan. Parquet files are supported by Spark for predicate pushdown.

Optimized File Formats: Parquet has repeatedly proved to be a superior format, when applicable. There are more integrations that need to happen in the future to ease the process, however.

DatasetSize on Amazon S3Query Run TimeData ScannedCost
Data stored as CSV files1 TB236 seconds1.15 TB$5.75
Data stored in Apache Parquet Format130 GB6.78 seconds2.51 GB$0.01
Savings87% less when using Parquet34x faster99% less data scanned99.7% savings

The importance of Optimized File Formats (Source: Databricks)

Future Trends

Geoanalytics Engine as a relatively new tool already gives us lots of needed functionality, with more being added as recently as 1.2. Two tools that we hope to see linked in the near future are ArcGIS Geoanalytics Engine (powered by Databricks) and Spark (paired with ArcGIS Knowledge), with theirability to generate knowledge graphs and graph database insights through entity-relationship data. 

Conclusion

Databricks as a unified platform, Spark as it’s engine, and GeoAnalytics Engine in conjunction prove to be a highly scalable, highly performant geospatial trinity. GAE is an emerging technology early in development stages with much promise to come. As more industries utilize these technologies for GIS, the challenges and shortcomings will continue to dissipate, further integrations will arise, and more resources will emerge.


We encourage GIS users to continue to monitor these tools, and if you’re not already utilizing GIS Data Science reach out to the GEO Jobe Data Science team via email at connect@geo-jobe.com!

Bonus Resources

Explore more from the MapThis! Blog

Avatar photo

Data Scientist / GeoAI Specialist