Wednesday, October 17, 2007

Adelaide Geek Dinner Approaches

I announced earlier that I was planning a dinner get-together for developers in Adelaide. I have since finalised the time and venue and have established an initial guest list.

The dinner will be held on Saturday November 17th starting 6:30pm at Cafe Buongiorno in the city. The cafe is at 187 Rundle Street, near the corner of Pulteney Street and the entrance to U-Park (for convenient parking). They serve familiar Italian meals - pizza and pasta starting at around $15.

I have found and contacted several local developer-bloggers and asked them to attend, some have committed, some have tentatively accepted but the response has been positive. If you are, or know, a local developer-blogger who hasn't been personally invited, send me an email and I'll add you to the guest list and ensure I book a large enough table at the cafe.

I look forward to seeing you there.

 Tuesday, October 16, 2007

PowerShell Resources

Ever since I heard of the concept of PowerShell (or Monad as it was known then) I was excited. Now that is been RTM for some time and I have had an opportunity to work with it in a production environment I love it even more.

While PowerShell could be summarised as a cross between a *nix shell and the .NET Framework there is still a lot unique to PowerShell alone and learning how it works and finding efficient tools to work with it is still necessary to make the most of PowerShell.

To learn PowerShell I purchased Bruce Payette's book Windows PowerShell In Action. It is a good-size detailed book at over 500 pages and receiving both the soft-cover book and the searchable PDF was excellent value. Pretty much every aspect of PowerShell is described including why certain design decisions were made. My only issue with the book, and it's not a big issue, is that I was itching to write some PowerShell scripts but script files aren't explained until Chapter 8 and security for script files isn't fully explained until Chapter 13 (the last chapter).

When I started to write my scripts I had the PowerShell console open on one monitor and Notepad open on the other. I would try certain commands in the console window and when they worked and gave the results I wanted I would copy them to the script in Notepad, save, and switch back to the console window to test the script. I still pretty much work like this today but I've replaced Notepad.

Considering I spend most of my time in Visual Studio I really wanted the same Intellisense and Syntax Highlighting experience when writing PowerShell scripts. The first PowerShell "IDE" I encountered was PowerShell Analyzer but I felt overwhelmed by the UI given that all I wanted was to edit .ps1 files. It seems very capable but just didn't feel right. More recently I have tried PowerGUI and it is very close to "Notepad for PowerShell". I have used it to develop the scripts for the last two PowerShell posts and I recommend it.

Do you have any resources you feel have been invaluable for getting the most out of PowerShell?

 Sunday, October 14, 2007

Report Services Automation With PowerShell

In late September Paul Stovell wrote about a set of VB.NET scripts he prepared to help deploy reports to SQL Server Reporting Services. If you've ever had the displeasure of deploying SSRS reports without Visual Studio then you'll understand how much it sucks.

Paul went to the effort to write individual scripts for creating folders and data sources on the server and uploading report definitions and configuring permissions. With Paul's work simple command scripts can then be used deploy reports.

However these command scripts still need to be written and they end up containing much of the same information as can be found in the .rptproj project file and the .rds data source files. I despise the idea of maintaining any sort of configuration information in more than one place so adding to the deploy command script whenever I add a report to the project in Visual Studio just makes me cringe.

Additionally, as Paul briefly mentions, MSBuild (and therefore Team Build) does not support Report Services projects so, once again, to deploy your reports as part of Continuous Integration you need to have separate tools.

Today I constructed a lengthy PowerShell script to take a Report Services .rptproj project file and output a command script that utilises Paul's VB.NET scripts to deploy the reports as per the project settings. Due to the size of the script rather than publishing it inline, you can download it here.

The script accepts three parameters. ProjectFile is the path to .rptproj file for the reports you want to deploy. If you omit this parameter the script uses the first report project file it finds in the current directory. The second parameter, ConfigurationName tells the script which project configuration to use for the target server URL and destination folders. If you omit this parameter the script uses the first configuration defined in the project. The last parameter SearchPaths is a list of paths for the script to search when locating both rs.exe and Paul's .rss files. The SearchPaths parameter is automatically combined with the environment PATH variable and may be omitted.

Here is an example usage:

PS C:\Users\Jason\Dev\MyReports> .\Deploy-SqlReports.ps1 `
    -ProjectName MyReports.rptproj `
    -ConfigurationName Release `
    -SearchPaths "C:\Tools\Report Services\" `
    | Out-File deploy.cmd -Encoding ASCII;

As always, my PowerShell skills are slowly improving and this script is not necessarily perfect in either robustness or efficient use of PowerShell. Hopefully it will be as useful to you as it has been to me and any changes you need should be easily made. Please leave a comment with your thoughts and suggestions.

 Saturday, October 13, 2007

Find Duplicate Files With PowerShell

I have pieced together a simple PowerShell script to recursively locate all duplicate files (by content, not name) below a chosen directory. It is not the most elegant code but for my purposes it works and hopefully you will be able to tweak it to suit your needs.

Firstly, it filters out any zero-length files. Zero-length files are naturally duplicates of each other and can be found quite trivially without my script. Secondly it groups all files by their length because if the length doesn't match, they can't have the same content. The script then excludes the length-groups with only one entry and calculates the MD5 hash of the remaining files. Groups of files with both matching size and hash are then returned in the results.

The hashing function was taken from the Duplicate Files post on the Windows PowerShell team blog. It simply uses the .NET cryptography namespace to compute the hash. From here you could easily exchange the MD5 algorithm for SHA1 or any other preferred algorithm.

Due to the need to read the entire contents of potentially matching files to compute the hash this can cause the script to take a long time against larger files. Executing the script against deep directory structures with many files will take longer too. The script could be easily modified to take a filtered input of files to only find, for example, duplicate photos.

Here is the script:

param ([string] $Path = (Get-Location))

function Get-MD5([System.IO.FileInfo] $file = $(throw 'Usage: Get-MD5 [System.IO.FileInfo]'))
{
    # This Get-MD5 function sourced from:
    # http://blogs.msdn.com/powershell/archive/2006/04/25/583225.aspx
    $stream = $null;
    $cryptoServiceProvider = [System.Security.Cryptography.MD5CryptoServiceProvider];
    $hashAlgorithm = new-object $cryptoServiceProvider
    $stream = $file.OpenRead();
    $hashByteArray = $hashAlgorithm.ComputeHash($stream);
    $stream.Close();

    ## We have to be sure that we close the file stream if any exceptions are thrown.
    trap
    {
        if ($stream -ne $null) { $stream.Close(); }
        break;
    }

    return [string]$hashByteArray;
}

$fileGroups = Get-ChildItem $Path -Recurse `
    | Where-Object { $_.Length -gt 0 } `
    | Group-Object Length `
    | Where-Object { $_.Count -gt 1 };

foreach ($fileGroup in $fileGroups)
{
    foreach ($file in $fileGroup.Group)
    {
        Add-Member NoteProperty ContentHash (Get-MD5 $file) -InputObject $file;
    }

    $fileGroup.Group `
        | Group-Object ContentHash `
        | Where-Object { $_.Count -gt 1 };
}

Once you have the output of the script you could use it delete the unnecessary files:

$dupes = Get-DuplicateItems;
$dupes | % { ($null, $rest) = $_.Group; $rest; } `
| Remove-Item -WhatIf;

As always, if you have any suggestions or improvements don't hesitate to leave a comment here.

BlogML Contribution

I was hit with trackback spam on my blog some time ago and decided the easiest way to stop it was to disable trackbacks on the site all together. The unfortunate downside to this is that I'm not automatically notified when someone blogs in response to one of my posts. I have to go looking.

Today I found a post by Doron Yaacoby written in August (sorry I took so long) about the Live Space support I added to the BlogML project. As I originally mentioned when I posted about this feature, I was too lazy to code a GUI for it. Thankfully, Doron has written one for us, and in WPF too!

It has been a while since I actually used my Live Space BlogML code to convert my own blog but, in response to Doron's note about comment support, from memory the API exposed by the Live Spaces website for querying blog data does not expose comments. Maybe they've changed the API since. Maybe there's a clever way to screen-scrape it.

Thanks for contributing Doron, and thanks for your kind words about my code too.

 Sunday, September 30, 2007

Adelaide Geek Dinner

I am organising a dinner get-together for local developers to meet, socialise, and learn from each other. I am still finalising details of the time and venue (which will depend on the numbers of guests) but I am currently thinking it will be an evening in the middle of November at a restaurant in the Adelaide CBD.

Also, to avoid being overwhelmed by numbers and to ensure the concept will work here before scaling up, I am going to start by only inviting .NET developers who also publish a blog. I already know a few such people in South Australia and have found a few more on Google and will ensure they receive an invitation but I'm sure I've missed some.

So, if you are, or know someone who is, a .NET developer and a blogger and will be in Adelaide in November, send me an email or post a comment here, and I will make sure you receive an invitation when the details have been settled.

If you are a .NET developer but not a blogger and would really like to attend, just grab a free Blogger or Live Spaces account and post one article about your .NET experiences then let me know.

If the evening is a success then we can plan to have these dinners on a regular basis and open the guest list to a larger range of people. Looking forward to meeting you...

 Friday, September 28, 2007

Streaming Large Objects with ADO.NET, Properly

It has been a while since I last worked with storing files in a SQL database and I decided to Google around to remind myself of the best way to do it. I was very disappointed with most of the approaches I found. Unfortunately, my Google-Fu didn't return the MSDN articles I've linked to below, and I had to find out the hard way.

To begin, all solutions I found dealt only with reading a BLOB from a SQL Server image or varbinary(max) column in a streaming fashion. Worst of all very few actually understood what streaming should do, and that is not load the entire object into an array in memory.

My whinging aside, streaming a file out of a SQL table is easy. You start by using a DataReader created by passing CommandBehavior.SequentialAccess to a DbCommand's ExecuteReader function. I also find that selecting only the blob column and only the desired row(s) from the table is the most effective.

When you have the DataReader positioned on the appropriate record you repeatedly call the GetBytes method in a loop, retrieving a small chunk each time and writing it to the output stream. The output can be any IO.Stream like a file or even your ASP.NET response. This MSDN article has a good description of the situation with the SequentialAccess enumeration and some sample code.

Writing a stream of data into a SQL table turned out to be slightly less obvious. I'm only working with SQL Server 2005 so I didn't consider supporting older versions but the approach is similar. SQL 2005 provides a Write "method" on the large value data types in the UPDATE statement.

My solution was to first insert the new row into the table providing values for all columns except the blob. Then I had a stored procedure that would take the row's primary key values, an offset, and a chunk of the data to insert and use the UPDATE .Write method to update the row.

Similar to the reading code, my writing code would read a small chunk from the incoming IO.Stream and pass it to the stored procedure, incrementing the offset each time. Once again, there is another MSDN article that describes the process well but their code looks like it will also work with SQL versions prior to 2005.

In both cases tweaking the size of the chunk used in each iteration of the loop will require some testing and measuring to find the best performance but now you can read and write files of almost 2GB into SQL Server without trying to allocate a similarly sized array in memory first.

 Wednesday, September 26, 2007

Business and Data Layer Architectures

Strongly Typed DataSets in .NET 2.0 do many things very well but they fall short in other places. For a new project I have been researching alternative approaches for structuring the model, logic, and data access projects to allow for easy customisation and good testing.

Firstly, there are some features about DataSets that are very useful and would need to exist in any alternative I consider:

  • Collection filtering and sorting by any properties.
  • Collection and item change notification.
  • IEditableObject support (ie BeginEdit, EndEdit, CancelEdit).
  • Excellent encapsulation of SQL commands.

There is also the limitations of DataSets that should not exist in the alternative I choose:

  • Computed values too complex for DataColumn.Expression cannot be bound to.
  • AllowDBNull and MaxLength violations cause immediate exceptions.
  • Restrictive relationship between model and underlying database structure.
  • Does not utilise Nullable<T> to support columns that allow DBNull.
  • Change notification events cannot be completely silenced.

So far I have briefly investigated CSLA, SubSonic, NHibernate, and some existing applications both real (DotNetKicks) and example (MS Pet Shop), and found that they are all very similar in many ways. Rather than commit myself to a particular framework that may have it's own set of issues I have decided to hand code my own solution to the business and data layering problem.

Others have expressed concerns about going with a third party framework and to a point I agree, but I'm not totally sold either way yet.

I intend to watch my own code evolve as I solve the difficulties associated with the task and refactor my work into common classes. I expect that my code will begin to look much like one of the existing frameworks (although that may be biased by what I've already seen) and perhaps I will end up switching to the solution that most closely matches my own.

The last thing we need is yet another framework.

 Friday, September 21, 2007

Staying Optimistic

The topic of concurrency control in database applications has been on my mind lately and I wanted to get my thoughts in order. Here I intend to present my views on the situation and probably contort some official definitions in the process.

At the outset there are two choices: pessimistic concurrency that will lock data from the time an edit begins until the time the edit is committed or optimistic concurrency that does not lock but only commits data that has not been changed by another process since the edit began.

Pessimistic concurrency has it's place but for user databases it is losing popularity for two reasons. Firstly it requires additional server resources to track which data is locked and by whom, and secondly it means that one user will need to wait before working on the same data that another user is editing.

Optimistic concurrency avoids the first issue by having the client remember a checksum or timestamp of the original data that it can check for at commit time to ensure no other changes have been made. The compromise is that if changes have been made, the user has to either re-input their changes to the new data or use a merging tool (if available) to reconcile the differences.

Obviously, optimistic concurrency is going to work really well in situations where it is rare for two users to edit the same data and in practice with databases full of thousands of records and a only a relatively small group of users working with them, the chances are low. (So low that many database applications don't bother with a merging tool).

However, time is a major factor with optimistic concurrency. The less time between the reading of the original data and the final committing of new data the less likely another user has changed the same data. If a user begins updating a customer's address though, but stops to answer a phone call then go to lunch, when this user returns and remembers to save there is a much greater chance a fellow employee has worked on the customer record.

There are additional factors beyond a user's actions that can increase the chance of an optimistic concurrency violation. It is fair to expect that an application should load at least a subset of records when it first starts to give the user a populated initial view but now the clock is ticking because the application has remembered the state of the data at the time the program opened. Assuming the architecture handles it, this is easily resolved by re-retrieving the latest data from the server immediately before the user begins making changes.

Also, while it may only take 30 seconds for the user to type in a new phone number for an existing customer, the chance of a concurrency violation continues to grow as each minute passes until the application pushes these changes back to the database. In this example having the application save the new data to the server when the user closes the customer dialog is an obvious solution. But time between read and commit is not the only concern...

When there is a concurrency violation, some action needs to be taken. An easy but often inappropriate action is to simply ignore the changes on the server and overwrite them with the latest data from the user. If that was acceptable for your application you wouldn't be here worrying about concurrency. As mentioned briefly above the options are to compare and merge or discard and re-input.

With either approach, the difficulty to merge or re-input grows proportionally with both the time since the edit was made and the amount of data being committed. The difficulty lies not only in the user's ability to remember all the changes they have entered but also the application's ability to determine the concurrency relationships between the various data elements being committed (ie if an existing order conflicts due to a changed shipping address should the added order line be considered a conflict too?).

Use optimistic concurrency, save often, save automatically.

 Saturday, September 15, 2007

PowerShell Vulnerability

Some time ago I put together a template for a file that can be interpreted as both a batch command script for cmd.exe and a SQL script for SQL Server Management Studio and sqlcmd.exe. It works really well for enabling someone unfamiliar with SQL Server to deploy database update scripts.

I've been working with PowerShell quite a lot lately and I was wondering if it would be possible to have a file that would be interpreted by both cmd.exe and PowerShell.exe. It turns out that Jay Bazuzi, a developer at Microsoft, has already found an elegant solution. However, upon inspecting the code I became concerned how it would be effected by PowerShell's default setting to disable all scripts and especially disable unsigned scripts from remote sources.

I have since worked with Jay's sample on my home machine and discovered that it effectively bypasses PowerShell's anti-script security by piping the commands to it's interactive mode. As a result, it was relatively trivial to write a batch command script that would enable PowerShell to run all future scripts and modify the user's PowerShell profile to ensure it stays that way.

Thankfully, security in the OS and other applications is making it harder to get a command script on to another computer and get the user to run it but with the capabilities of PowerShell potentially available for malicious use it's just one more reason to not run as an administrator and minimise the damage.

 Friday, September 14, 2007

The Theory Of Thirds

It can be expected that for any blog about a difficulty experienced by a user of  Windows that the third comment to be posted will recommend the Linux or Mac OS as the solution to the problem. The first two comments preceding the third usually add actual value for other readers of the blog.

The theory is not restricted to just Windows and could equally be applied to a language (eg VB), a concept (eg Agile), a product (eg iPod), or any topic with distinct followers and opposers. Obviously, in these cases the suggested solution corresponds to the exchanged topic. For example, an article about VB's failings will be responded to with "You should use C# instead".