Tuesday, January 20, 2009

Loading GeoNames Data Into SQL Server 2008 (the PowerShell, bcp way...)

 

Loading GeoNames Data Into SQL Server 2008 (the PowerShell, bcp way...)

In the first installment on this topic, I used a commercial text editor, EditPad Pro, to perform the Unicode file conversion and SQL Server BULK INSERT to load the data. In this installment, I'm going to change to workflow a bit to use of Microsoft's PowerShell for the Unicode file conversion and bcp (Bulk Copy) to perform the data loading.  Thus, the new workflow is:

  1. Download GeoNames data
  2. Create a database table for the GeoNames data
  3. Convert and load the GeoNames data using PowerShell and bcp
  4. Add, update and index the spatial data column
  5. Sample spatial query

For this post, I'm going to concentrate on step 2.  For the other workflow steps, please refer back to the first installment on this topic, Loading GeoNames Data Into SQL Server 2008 (yet another way...).

Convert and load the GeoNames data using PowerShell and bcp

For this exercise, my colleague, Bob Beauchemin, gave me some very useful guidance.  If you are not familiar with Bob's blog, it a great place to find a wealth of information about SQL Server.  Bob has special affection for spatial and you can find some very interesting posts on this topic.

Install and Configure PowerShell

The basis for this solution is the Microsoft Windows PowerShell command line shell. Being an old UNIX guy, shell scripting was my first inclination when presented with the UTF-8 --> UTF-16 conversion problem.  And I'm happy to say that PowerShell provides and excellent Windows-based shell for doing exactly what is needed. If you've got Windows Server 2008, it's already there.  If you have Windows XP, Windows Vista, or Windows Server 2003, you will need to download the appropriate version for your platform.

Once installed, you can fire up PowerShell by issuing the command, powershell, in the command shell (Start-> Run -> Open: cmd).  When loaded, you will get a PS> prompt.  Before we can execute a script, we need to set the script execution policy.  For our humble purposes, we don't need to be bothered by the security considerations which might be demanded in a more public execution environment.  If you decide to put your script into wider use, please do set an execution policy suitable for your environment.

Try running the following commands:

PS> Get-ExecutionPolicy
Restricted

PS> Set-ExecutionPolicy Unrestricted

PS> Get-ExecutionPolicy
Unrestricted

Note: The execution policy persists between PowerShell sessions, so you only need to set it once.

To exit PowerShell, type exit at the PowerShell Prompt. To learn more about PowerShell see the PowerShell Help Center.

Create PowerShell Script

You have now configured PowerShell so that you can easily run the script we are about to describe.

In this script, we will do the following:

  1. Create a temporary file name to hold the output from the command which converts the UTF-8 data to UTF-16.
  2. Execute a command which converts the data from UTF-8 to UTF-16, changes the row terminators to '\r\n' and writes a Unicode signature at the beginning of the converted data (the BOM is described in the previous post).
  3. Load the converted data file into SQL Server 2008 using the Bulk Copy Program, bcp
  4. Delete the temporary file

All of these operations will be saved as a function which we can run from the PowerShell command line, called convert-and-load.  Here is the PowerShell script:

function convert-and-load {
  $tempfile = "$env:temp\TEMP$(Get-Date -format 'yyyyMMddhhmmss').txt" 
  get-content -Path $args[0] –encoding utf8 | out-file $tempfile -encoding Unicode 
  $serverinstance = $args[2] 
  $serveroption = "-S$serverinstance" 
  bcp $args[1] in $tempfile -w –T $serveroption 
  del $tempfile
  }

The bcp -w and -T arguments do the following (from MSDN bcp Utility):

-w Performs the bulk copy operation using Unicode characters. This option does not prompt for each field; it uses nchar  as the storage type, no prefixes, \t (tab character) as the field separator, and \n (newline character) as the row terminator. Note that even though it says \n (newline) for the row terminator, it really means \r\n (carriage return\newline).

- T Specifies that bcp connects to SQL Server with a trusted connection

The next step is to install this script so that PowerShell recognizes the function.

Install PowerShell Script

You can install this new function into your PowerShell profile. This profile should be placed in the C:\Documents and Settings\[your Windows login]\My Documents\WindowsPowerShell folder.  You can name the profile anything you want but it has to have a prefix of .ps1.  I named mine: profile.ps1.  When PowerShell starts up, it reads this file and loads, in this case, our new convert-and-load function.

Execute PowerShell Script

The new function, convert-and-load, takes 3 arguments:

  • UTF-8 filename - e.g. c:\temp\allCountries.txt
  • fully qualified database table name - e.g. Sample_GeoNames.dbo.GeoNames
  • database server\database instance name - e.g. sqlspatial\ss1600

Here is how the command looked in my environment, ready for execution:

PS> convert-and-load c:\temp\allCountries.txt Sample_GeoNames.dbo.GeoNames sqlspatial\ss1600

UTF-8 to UTF-16 File Conversion Performance

The entire convert and load process of the allCountries.txt GeoNames dump (6,906,334 records) took approximately 1 hour and 25 minutes to complete with the UTF-8 to UTF-16 conversion process consuming approximately 1 hour and 5 minutes of that time.  This is comparable to the time it took to perform the same conversion in EditPad Pro from my previous post.

bcp Performance

I saved the temporary UTF-16 file by removing the del $filename line in the script and ran 3 scenarios to get a better feel for the performance of bcp.  Here are the results:

Data residing on c: drive (same drive as the database storage):

6906334 rows copied.
Network packet size (bytes): 4096
Clock Time (ms.) Total: 1112859 Average: (6205.94 rows per second)
18.55 minutes

Data residing on e: drive (database storage on c:):

6906334 rows copied.
Network packet size (bytes): 4096
Clock Time (ms.) Total: 1009954 Average: (6838.27 rows per second)
16.83 minutes

Data residing on e: drive, increased network packet size:

6906334 rows copied.
Network packet size (bytes): 16192
Clock Time (ms.) Total: 1102265 Average: (6265.58 rows per second)
18.37 minutes

There are other tuning measures to improve bcp's performance  and we will investigate this for a future post. Compare this with the 5.75 minutes it took to load the same data using the BULK INSERT T-SQL command (see the "Load the GeoNames data" section from my previous post).

Conclusion

While we give up some performance, mainly due to bcp, we do have a more convenient workflow.  It is also important to recognize how bcp and BULK INSERT affect their respective execution environments.  bcp places the computing load on the local client machine.  BULK INSERT, on the other hand, places the load directly on the database server.  If you are trying to minimize load on a busy database server, then bcp may be the best solution.

In my next installment on this topic I will explore the SQL PowerShell Extensions, providing access to BULK INSERT from within the PowerShell script.

Technorati Tags: PowerShell,bcp,BULK INSERT,data Loading,GeoNames,SQL SERVER,SQL Server 2008,Spatial,UTF-8,UTF-16,Unicode

Published 20 January 09 11:56 by EdKatibah

Spatial Ed : Loading GeoNames Data Into SQL Server 2008 (the PowerShell, bcp way...)

No comments:

Blog Archive