Announcing the Release of PowerTools for Open XML V1.1
Today, I’m pleased to announce the release of PowerTools for Open XML V1.1. PowerTools for Open XML is an open source project on CodePlex that makes it easy to create and modify Open XML documents using PowerShell scripts. I introduced the PowerTools for Open XML in June 2008 in the post, Automated Processing of Open XML Documents using PowerShell. That post contains a screen cast that demonstrates the functionality in the initial release. You can find a list of all cmdlets in PowerTools for Open XML here.
Note: In this post and screen cast, I’m going to focus on the new functionality in this new release of PowerTools for Open XML. See the above links for more information on the other cmdlets.
It’s also important to note that this is not a supported Microsoft product and doesn’t necessarily represent future product direction. We think it will serve as inspiration for customers who need to create and modify Open XML documents programmatically.
This new release (1.1) is an important one. It provides guidance and example code for dealing with one of the more complicated issues associated with Open XML, which is interrelated markup in word processing documents. The following screen cast demonstrates some of the new functionality in version 1.1. The screen cast contains the same information that is presented in this post – take your pick about how you like to consume your information. It’s interesting to see the cmdlets in action – they are fast!
Video: PowerTools for Open XML 1.1
Paragraphs in word processing documents can contain markup that is related to markup elsewhere in the document – either to markup in another paragraph, or to content in other parts of the Open XML package. The post, Inserting / Deleting / Moving Paragraphs in Open XML Wordprocessing Documents, introduces this issue in detail. It describes a number of ways that markup is interrelated. The post Move/Insert/Delete Paragraphs in Word Processing Documents using the Open XML SDK introduces the C# code that is the basis for the most important new functionality in PowerTools V1.1.
How to Download and Install the PowerTools for Open XML
You can download the source code for PowerTools for Open XML on CodePlex. Click on the Releases tab to find a zip file that contains the source code.
If you just want to use the PowerTools for Open XML without compiling the source code, you can download binaries at two different places:
StaffDotNet, a consulting company, has posted the binaries here.
Julien Chable has posted the binaries here.
Interrelated Markup
First, I’ll describe exactly what I mean by ‘Interrelated Markup’.
The following screen clipping shows a small document that has a comment that spans paragraphs:
The markup for this document looks like this:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<w:document
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
<w:body>
<w:p> <!-- First Paragraph -->
<w:r>
<w:t>This is the fir</w:t>
</w:r>
<w:commentRangeStart w:id="0"/>
<w:r>
<w:t>st paragraph.</w:t>
</w:r>
</w:p>
<w:p> <!-- Second Paragraph -->
<w:r>
<w:t>This i</w:t>
</w:r>
<w:commentRangeEnd w:id="0"/>
<w:r>
<w:rPr>
<w:rStyle w:val="CommentReference"/>
</w:rPr>
<w:commentReference w:id="0"/>
</w:r>
<w:r>
<w:t>s the second paragraph.</w:t>
</w:r>
</w:p>
</w:body>
</w:document>
As you can see, some of the markup associated with the comment is in the first paragraph, and some of the markup is in the second paragraph. Further, the <w:commentReference> element refers to markup in the Comments part of the package. If we were to simply move the first <w:p> element to another document, or another location in the same document, we would create an invalid document.
Merge-OpenXmlDocument Cmdlet
To deal with this issue, we’ve created a new cmdlet named Merge-OpenXmlDocument. This cmdlet takes as input multiple source documents, along with a range of paragraphs for each source document, and constructs a new, valid Open XML document:
We’ve identified 15 issues where paragraphs have interrelated markup. There are basically three types of issues:
If a paragraph or a run refers to a style or font, the style or font needs to be in the newly constructed document.
If a paragraph has markup that is related to markup in another paragraph, that markup is fixed so that the newly constructed document contains all necessary markup. The issue with comment markup is an example of this. Bookmarks are another one.
If a paragraph contains markup that is related to another part, then that markup is fixed, and the related part is fixed, so that the newly constructed document is valid. For example, the text of the comment needs to be put into the comments part of the new document. If a paragraph contains an image, then the image part is inserted in the new document.
We can run the Merge-OpenXmlDocument cmdlet on the document that has a comment that spans a paragraph, specifying that the new document contains only the first paragraph, like this:
Merge-OpenXmlDocument -OutputPath Test01New.docx `
-Path Test01.docx -Start 1 -Count 1
This line of PowerShell script takes Test01.docx as input, and creates Test01New.docx. The Start and Count parameters specify that the new document will contain just the first paragraph of the source document. When we run it, the new document looks like this:
The markup of the new document looks like this:
<?xml version="1.0" encoding="utf-8"?>
<w:document
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
<w:body>
<w:p> <!-- There is only one paragraph -->
<w:r>
<w:t>This is the fir</w:t>
</w:r>
<w:commentRangeStart w:id="0" />
<w:r>
<w:t>st paragraph.</w:t>
</w:r>
<w:commentRangeEnd w:id="0" />
<w:commentReference w:id="0" />
</w:p>
</w:body>
</w:document>
</w:document>
As you can see, the markup to delineate the comment is now contained in the first paragraph of the new document.
Merging Content from More than One Document
You can specify more than one source document, as well as a range for each, and compose a new document from them. Here is document Test02a.docx, with a comment in the first paragraph, and tracked revisions in the third paragraph:
Here is document Test02b.docx. It has a few paragraphs styled as Heading 2, with a comment on the second paragraph:
Let’s merge those two documents, taking the first three paragraphs of Test02a, and taking the first two paragraphs of Test02b. Following is a script to do this. In this script, we instantiate two instances of a class: OpenXml.PowerTools.DocumentSource. We pass the paths of the source documents to the constructor, as well as the starting paragraph and paragraph count for each. Then, we invoke the Merge-OpenXmlDocument cmdlet, specifying the output path, and the two source objects that we created. From this, you can see that you can programmatically setup your sources in just about any way that you like.
$doc1 = New-Object `
-TypeName OpenXml.PowerTools.DocumentSource `
-ArgumentList C:\PowerToolsOpenXml\Test02a.docx, 1, 3
$doc2 = New-Object `
-TypeName OpenXml.PowerTools.DocumentSource `
-ArgumentList C:\PowerToolsOpenXml\Test02b.docx, 1, 2
Merge-OpenXmlDocument -OutputPath Test02New.docx -Sources $doc1,$doc2
Here is the resulting document.
You can see that the comments from both source documents are moved to the new document. The tracked revisions are moved also.
You can also construct an array of OpenXml.PowerTools.DocumentSource objects, and pass the array to Merge-OpenXmlDocument.
Using Merge-OpenXmlDocument to Remove Paragraphs
You can also use Merge-OpenXmlDocument to remove paragraphs from the middle of a document. You can do this by specifying as sources the same document twice, first specifying the paragraphs before the range you want to delete, and then specifying the paragraphs after the range to delete.
The following document contains three paragraphs, with an image in each paragraph:
We can run the following script to create a new document comprised of just the first and third paragraphs:
$doc1 = New-Object `
-TypeName OpenXml.PowerTools.DocumentSource `
-ArgumentList C:\PowerToolsOpenXml\Test03.docx, 1, 1
$doc2 = New-Object `
-TypeName OpenXml.PowerTools.DocumentSource `
-ArgumentList C:\PowerToolsOpenXml\Test03.docx, 3, 1
Merge-OpenXmlDocument -OutputPath Test03New.docx -Sources $doc1,$doc2
Running this script produces the following document:
You can see that the new document only has the first and third paragraphs. And you can see that the appropriate images have been moved to the new document. If we were to open and examine the parts in the new document package, you would see that only the two images were moved to it.
Moving Styles and Fonts to the Merged Document
One important aspect of the functionality of Merge-OpenXmlDocument is how it handles resources such as styles and fonts. It always takes the first definition of a style in the list of source documents. To demonstrate this, let’s look at a couple of documents.
The following document contains two paragraphs. The first paragraph, styled Code1, is in the Courier New 10 point font. The second paragraph, styled Code2, is in the Courier New 16 point font:
The following document also contains two paragraphs. The first paragraph, styled Code2, is in the Lucida Console 10 point font. The second paragraph, styled Code2, is in the Lucida Console 16 point font:
The following is a simple use of Merge-OpenXmlDocument to concatenate these two documents into a single document:
Merge-OpenXmlDocument `
-OutputPath Test04New.docx `
-Path Test04a.docx,Test04b.docx
The resulting document is below. This document inherited the Code1 and Code2 styles from Test04a.docx, and inherited the Code3 style from Test04b.docx.
Select-OpenXmlString Cmdlet
But assembling documents in this fashion is only half the story. We also need to be able to programmatically find the paragraphs we’re interested in. In this new version of the PowerTools, there is a new cmdlet named Select-OpenXmlString, which has similar functionality to the Select-String cmdlet that comes with PowerShell. We need to use paragraph numbers when merging documents. Select-OpenXmlString can find those paragraph numbers.
The following document contains six paragraphs – three styled as Heading2, and three styled as Normal.
Our goal is to find a specific heading paragraph, and then create a new document that contains the paragraph following the heading paragraph.
The following script shows using Select-OpenXmlString to find a paragraph styled as Heading1 that contains the content of “Para2”:
Select-OpenXmlString `
-Path Test05.docx `
-Style Heading1 `
-simpleMatch Para2
This produces the following output:
PS C:\PowerToolsOpenXml> Select-OpenXmlString `
>> -Path Test05.docx `
>> -Style Heading1 `
>> -simpleMatch Para2
>>
Path : C:\PowerToolsOpenXml\Test05.docx
Filename : Test05.docx
ElementNumber : 3
Content : Para2
Style : Heading1
Pattern : Para2
IgnoreCase : True
Select-OpenXmlString produces a collection of paragraph objects that each contain:
The path and name of the document being searched.
The ElementNumber of the found paragraph or paragraphs.
The Style of the found paragraphs.
The Content of the found paragraphs.
If we want to compose a document that contains the single paragraph following the selected paragraph, we can write a script as follows. This script first finds the paragraph that we’re interested in, and assigns the element number to the variable $a. It then adds one to $a, and uses the result when invoking Merge-OpenXmlDocument:
$a = (Select-OpenXmlString `
-Path Test05.docx `
-Style Heading1 `
-simpleMatch Para2).ElementNumber
$b = $a + 1
Merge-OpenXmlDocument `
-OutputPath Test05New.docx `
-Path Test05.docx -Start $b -Count 1
The new document contains just the paragraph following the heading paragraph that we found.
Splitting a Document into Multiple Documents
We can use that same source file, and do something pretty cool – we can split the document into multiple documents. Each ‘Heading1’ paragraph starts a new document.
$source = "Test05.docx"
Select-OpenXmlString -Path $source -Style "Heading 1" |
ForEach-Object `
-begin `
{
$last = 0;
$num = 1;
} `
-process `
{
if ($last -eq 0) {$last = 1}
else
{
Merge-OpenXmlDocument `
-Path $source `
-Start $last -Count ($_.ElementNumber - $last) `
-OutputPath ("Split"+$num+".docx");
$last = $_.ElementNumber;
$num = $num + 1;
}
} `
-end `
{
Merge-OpenXmlDocument `
-Path $source `
-Start $last `
-OutputPath ("Split"+$num+".docx")
}
When you run this script on Test05.docx, it produces three new documents, Split1.docx, Split2.docx, and Split3.docx.
If you want to split a document on paragraphs styled either “Heading 1” or “Heading 2”, you can alter the Search-OpenXmlString in the above script like this:
$source = "AnotherDocument.docx"
Select-OpenXmlString -Path $source -Style "Heading 1","Heading 2" |
...
Extracting Text of a Document
We created the Select-OpenXmlString cmdlet primarily for the purposes of finding paragraph numbers to pass to the Merge-OpenXmlDocument cmdlet, however, it has interesting applications of its own. You can use it to extract the text of an Open XML document.
Select-OpenXmlString MyDocument.docx |
ForEach-Object -Process { $_.Content } >MyDocument.txt
And if we want to retrieve the text only for paragraphs of a specific style, we can do so like this:
Select-OpenXmlString -Style Code MyDocument.docx | `
ForEach-Object -Process { $_.Content } >MyDocument.txt
Using Select-OpenXmlString to Find Documents
You can use the –List parameter of Select-OpenXmlString to retrieve a list of all documents with specific content. In this directory, I have a lot of Open XML documents, and want to find all documents that mention France.
Select-OpenXmlString *.docx -simpleMatch France -List | Select-Object Filename
Select-OpenXmlString also allows specification of a regular expression.
Select-OpenXmlString *.docx -Pattern "Customer ID: L.*" –List
Details about the Merge-OpenXmlDocument Start and Count Parameters
To simplify the explanation about how Merge-OpenXmlDocument works, until this point in this blog post, I indicated that the Start and Count parameters are in terms of ‘paragraph numbers’. Actually, those parameters don’t refer to paragraph numbers; they refer to child elements of the <w:body> element. The vast majority of child elements of the <w:body> element are paragraph elements (<w:p>). However, content controls and tables can also be children of the <w:body> element. If the specified range of children elements of the <w:body> element includes content controls or tables, they are moved in their entirety to the newly constructed document.
Participate in PowerTools for Open XML
If you are a C# developer interested in either Open XML or in PowerShell, I invite you to participate in the PowerTools for Open XML project. Please contact my via the “EMAIL” link at the top of my blog.
Posted: Thursday, March 19, 2009 1:04 PM by EricWhite
Eric White's Blog : Announcing the Release of PowerTools for Open XML V1.1
No comments:
Post a Comment