Tuesday, January 13, 2009

Loading GeoNames Data Into SQL Server 2008 (yet another way...)

 

Loading GeoNames Data Into SQL Server 2008 (yet another way...)

In his post last year, Integrating Virtual Earth and GeoNames, Johannes described how to load the GeoNames database into SQL Server using the "Import Data" Wizard in SQL Server Management Studio. In his scenario, Johannes did not need to take advantage of the spatial type support in SQL Server 2008, since there were columns of latitude and longitude available for use by Virtual Earth.  But what if you wanted to load the GeoNames locations into SQL Server Spatial and use the new geospatial features?  This post describes how to do that in 6 easy steps...

  1. Download GeoNames data
  2. Prepare GeoNames data
  3. Create a database table for the GeoNames data
  4. Load the GeoNames data
  5. Add, update and index the spatial data column
  6. Sample spatial query

So, let's get started...

Download GeoNames Data 

You can find GeoNames data organized by country (and a few other organizational categories) by browsing the GeoNames Download page. If you wanted to download the GeoNames data for the United States, you would choose US.zip.  For this exercise, we are going to download the allCountries.zip file, containing the full worldwide GeoNames database.  Unzipping the downloaded file will produce the file allCountries.txt.

Note: This version of GeoNames data was download on January 13, 2009 and contained 6,906,334 records.

Prepare GeoNames Data

The GeoNames data is available as a tab-delimited, UTF-8 encoded text file. Each record is terminated with newline character ('\n').  SQL Server 2008 does not support UTF-8, expecting such data in UTF-16 encoding with each record delimited with a carriage return, newline combination ('\r\n').  Additionally, it needs a Byte Order Marker (BOM) at the beginning of the file.  Johannes introduced NAnt as a way to convert UTF-8 data to UTF-16.  I chose another way.  Using the Windows text editor, EditPad Pro, I converted the allCountries.txt file into a form compatible with SQL Server 2008. Here is the workflow I used with EditPad Pro:

Note: While this conversion could conceivably be done in Word 2007, Word cannot open and operate efficiently on files of this size.  EditPad Pro can reasonably handle data of this magnitude.

a. Open allCountries.txt file in EditPad Pro.

b. Set Record Delimiter. In the "Convert" menu (main menu), select "To Windows (CR LF)".  This will set the record delimiter to '\r\n'.

image

This took several minutes to complete in EditPad Pro after the "To Windows (CR LF)" option was selected - presumably loading the file into memory and performing the requested operation...

c. Set text encoding. While still in the "Convert" menu, choose "Text Encoding". 

image

In the Text Encoding menu note the Original Encoding, set to Unicode, UTF-8.  Select the "Encode the original data with another character set." button.  Under New Encoding, choose "Unicode, UTF-16 little endian".

image

After hitting "OK", my humble dual proc machine with 2GB of RAM squawked about low virtual memory, but continued to process the file...

d. Set the Byte Order Marker (BOM).  In the "Options" menu (main menu), select "Configure File Types...".

image

Choose the "Encoding" tab.  Under File Types, make sure "Text Document" is selected. Make sure that the Text Encoding is set as follows:

image

Note on the Default Line Break Style: While it appears that the line break was already set under stop b. (above), it appears that setting the line break in both places is required, though it is not obvious, why...

Note on the Byte Order Marker (BOM): You can see the BOM as the FF FE bytes at the image start of the file by pressing Crtl+H in EditPad Pro (this switches EditPad Pro to hex display mode). If your file does not include the BOM, SQL Server will complain, when loading data, that the "...file does not have a Unicode signature."

e.  Create new file copy, correctly encoded for SQL Server 2008. In the "File" menu (main menu), select "Save Copy As...".

image

Save the file with the desired name (I used allCountries_utf16.txt):

image

The file, allCountries_utf16.txt, is now a Unicode UTF-16, BOM encoded text file with tab-delimited ('\t') fields and CR\LF row terminators ('\r\n'), ready for SQL Server 2008.

Before we leave this section, here are some statistics:

FILE
             SIZE

allCountries.zip
   174,594 KB

allCountries.txt
   783,093 KB

allCountries_utf16.txt
1,563,976 KB

Create a database table for the GeoNames data

Here is the T-SQL to create a table called GeoNames...

CREATE TABLE GeoNames(
      geonameid int NOT NULL,
      name nvarchar(200) NULL,
      asciiname nvarchar(200) NULL,
      alternatenames nvarchar(max) NULL,
      latitude float NULL,
      longitude float NULL,
      feature_class char(2) NULL,
      feature_code nvarchar(10) NULL,
      country_code char(3) NULL,
      cc2 char(60) NULL,
      admin1_code nvarchar(20) NULL,
      admin2_code nvarchar(80) NULL,
      admin3_code nvarchar(20) NULL,
      admin4_code nvarchar(20) NULL,
      population int NULL,
      elevation int NULL,
      gtopo30 int NULL,
      timezone char(31) NULL,
      modification_date date NULL
)
GO

Note on the alternatenames column: This column holds the GeoNames field which requires Unicode (UTF-8, UTF-16).

Load the GeoNames data

I used the BULK INSERT command to load the allCountries_utf16.txt file.  Note the DATAFILETYPE = 'widechar' parameter - this is required for loading UTF-16 data.
BULK
  INSERT GeoNames
      FROM 'C:\temp\allCountries_utf16.txt'
            WITH(
                  DATAFILETYPE = 'widechar',
                  FIELDTERMINATOR = '\t',
                  ROWTERMINATOR = '\n'
                )
GO
--(6906334 rows(s) affected) (00:05:44)

Note on the ROWTERMINATOR: While I have changed the Row Terminator in the allCountries_utf16.txt file to '\r\n', SQL Server, curiously, requires the Row Terminator to be set to '\n'.  There is no good explanation for this...

Add, update and index the spatial data column

The current table does not contain a column to hold native spatial data.  Since the data is in WGS84 geographic coordinates, I used the geography data type for the new column:

ALTER TABLE GeoNames
  ADD geog GEOGRAPHY NULL
GO

To populate the new geography column (geog), I used the STGeomFromText constructor to create a POINT feature for each GeoNames row.  Since this is a text-based constructor, the longitude and latitude data, coming from columns of like name, will need to be converted into text using the CAST function.  The SRID for the STGeomFromText constructor is set to 4326, the EPSG value for WGS84 coordinates.

UPDATE GeoNames
  SET geog = GEOGRAPHY::STGeomFromText
('POINT(' + CAST(longitude AS CHAR(20)) 
+ ' ' + CAST(latitude AS CHAR(20)) + ')',4326)
GO
--(6906334 rows(s) affected) (00:09:40)

Note on coordinate order: Since this is an OGC-based operator, the WKT is formed as: 'POINT (<longitude> <latitude>)'.

In order to create a spatial index, the table must have a primary key.  While I could have created the primary key in the CREATE TABLE DDL, this would have meant that the data was loaded into an existing index, slowing down the load process considerably.

ALTER TABLE GeoNames
  ADD CONSTRAINT pk_geonames_geonameid
  PRIMARY KEY (geonameid )
GO

Here is the DDL to create a spatial index on the geography column, geog:

CREATE SPATIAL INDEX geonames_mmmm16_sidx
   ON GeoNames(geog)
   USING GEOGRAPHY_GRID
   WITH (
     GRIDS = (MEDIUM, MEDIUM, MEDIUM, MEDIUM),
     CELLS_PER_OBJECT = 16,
     PAD_INDEX = ON

   )
GO
--(00:10:30)

Sample spatial query

As an example of the type of queries which now be supported using the GeoNames table, consider the following query, which asks "find all place names within 1 kilometer of the Travis County, Texas boundary":

Note: the distance parameter for the STBuffer() method (geography data type) is in meters...

DECLARE @g GEOGRAPHY
DECLARE @h GEOGRAPHY
DECLARE @i GEOGRAPHY
DECLARE @j GEOGRAPHY
SELECT @g = geog FROM US_Counties WHERE NAME = 'Travis'
SELECT @h = @g.STBuffer(1000)  -- Positive buffer around Travis County
SELECT @i = @g.STBuffer(-1000) -- Negative buffer around Travis County
SELECT @j = @h.STDifference(@i)-- Polygon (buffer "band" around the boundary)
SELECT name FROM GeoNames WHERE geog.STIntersects(@j)=1 
  ORDER BY name ASC
--(91 rows(s) affected)
--SQL Server Execution Times:
--Elapsed time = 823 ms.

Here is the list of place names returned by the query:

Anderson Mill
Anderson Mill Elementary School
Apple Spring
Apple Spring Hollow
Austin Air Ranch Airport
Austin Raceway Park
Balcones Country Club
Balcones Country Club Lake
Balcones Country Club Lake Dam
Bear Creek
Best Western Southgate Inn Ste
Blessed Juan Diego Catholic Church
Bonnet Cemetery
Boultinghouse Mountain
Bratton Cemetery
Bridgeway Community Church
Cambrian Creek
Canyon Vista Middle School
Canyon Vista Pool
Cedar Park High School
Center Union Church
Chalk Knob
Chalk Knob Branch
Child Evangelism Fellowship Church
Church of Christ - Pond Springs
Clarence Washington Farm Dam
Clarence Washington Farm Lake
Cottonwood Branch
Crestwood Suites - Austin
Cypress Creek Baptist Church

Cypress Elementary School
Devils Hollow
Esa Austin-Round Rock-South
Fall Creek Cemetery
Fitzhugh
Fitzhugh Cemetery
Garfield Pumping Station
Gateway Community Church
Gay Hollow
Hamilton Hill
Hammetts Crossing
Hampton Inn Austin Round Rock
Harris Branch
Haynie Flat Cemetery
Hilton Garden Inn Round Rock 
Hope Presbyterian Church
Huddleston Cemetery
Ingram Cemetery
Jollyville
Jollyville Cemetery
Jollyville Elementary School
KGTN-AM (Creedmoor)
Koenig Ranch
La Frontera in Round Rock
La Quinta Inn & Suites Round Rock South
Manchaca Optimist Youth Sports Complex
Manchaca Springs
Marriott Austin North
Martin Hill
McNeil

McNeil High School
Merrelltown
Muleshoe Bend
Muleshoe Bend Trail
New Hope Community Church
Niederwald Cemetery
North Creek Park
Pond Springs
Presbyterian Church of the Hills
Purple Sage Elementary School
Rattan Creek Trail
Red Bluff Creek
Residence Inn By Marriott Austin Round Rock
Rhodes Cemetery
Rim Rock Trail
Round Rock Korean Presbyterian Church
Round Rock Opportunity Center
Roy Creek
Saint Vincent de Paul Catholic Church
Shingle Hills
Signal Hill
Spicewood Elementary School
Spillar Ranch
Studio 6 Austin Northwest
Tanglewood Spring
Texas No Name Number 9 Dam
The Marbridge School
Turkey Bend
Turkey Bend Trail
Vasquez Cemetery
Woods Hill

Here is the visual result of the place names locations, within the county boundary buffer, presented in Management Studio:

image

Here is a more detailed view (the actual point locations are represented by the block dots in the center of each point symbol):

image

And there you have it - sub-second performance against 6.9 million rows with a complex spatial object on a very modest machine...

Spatial Ed : Loading GeoNames Data Into SQL Server 2008 (yet another way...)

No comments:

Blog Archive