Charles Petzold



.NET Streams and Windows 8 IStreams

November 8, 2011
New York, N.Y.

I am currently engaged in writing an EPUB viewer for Windows 8. EPUB is a popular format for electronic books. The standard is maintained by the International Digital Publishing Forum (IDPF), and that's where you can find the documents that make up the EPUB specification. (I am basing my work on version 2.)

The specification has three parts: the Open Container Format (OCF), which describes how the book is assembled into a ZIP file, the Open Packaging Format (OPF), which describes the two required files in the package, and the Open Publication Structure (OPS) which describes the subset of XHTML used in EPUB documents. Anyone writing an EPUB viewer will also need to examine other specifications as well, in particular, XHTML, HTML 4, and CSS 2.

An EPUB viewer first needs to open a ZIP file, and then extract a small XML file named META‑INF/container.xml. This file references an OPF file (also XML) that contains information about the book — title, author, subject, publisher, etc. — and a "manifest" that lists all the files that contribute to the book (mostly HTML files, CSS files, image files, font files), as well as a "spine" that indicates the reading order of the HTML files. The OPF file also references an NCX file, which is a table of contents for the book. Each chapter in the book has an HREF starting position, which is a particular HTML file with a possible ID appendage.

An EPUB viewer can take one of two strategies in rendering the HTML: It can make use of a web browser, or it can do all the work itself. It wasn't quite clear to me if that first option was possible in Windows 8 at this time, and besides, I thought it would be more fun parsing and rendering the HTML and CSS myself.

Except for the CSS files, all the text files in the EPUB package are XML, including the HTML files, which are actually XHTML. Thus, besides some way to open the ZIP file, the availability of an XmlReader class is very useful for an EPUB viewer.

Fortunately, Windows 8 has both. The System.IO.Compression namespace includes a ZipArchive class, with a constructor that takes a Stream object. ZipArchive has an Entries property that lists all the files in the archive. More useful for accessing EPUBs is GetEntry, which returns a ZipArchiveEntry object for a particular file path within the archive. The ZipArchiveEntry class defines an Open method that returns a Stream object.

The Windows 8 System.Xml namespace includes the familiar XmlReader class with a static Create method that lets you create an XmlReader object based on either a Stream or a TextReader.

So far, so good. But the novice Windows 8 programmer's optimism might start to fade with a little glimpse into the System.IO namespace. The Windows 8 version of this namespace has been stripped of everything involving the file system. It is missing FileSystemInfo, File, FileInfo, Directory, DirectoryInfo, Path, and FileStream. The System.IO namespace still has a Stream class, but the only thing that derives from Stream is MemoryStream.

As I discussed in yesterday's blog entry Asynchronous Processing in Windows 8, a Window 8 program references a disk file with a StorageFile object (defined in the Windows.Storage namespace). StorageFile has methods named OpenAsync and OpenForReadAsync but these methods don't provide Stream objects. They provide IRandomAccessStream and IInputStream objects, respectively.

These Windows 8 stream interfaces are defined in the Windows.Storage.Stream namespace. IRandomAccessStream has a Size property and defines two methods, GetInputStreamAt and GetOutputStreamAt, which return IInputStream and IOutputStream objects, respectively.

The IInputStream interface defines just one method, ReadAsync, which lets you read bytes into an IBuffer object. Windows.Storage.Stream also includes a DataReader class that you create based on an IInputStream object and then read numerous .NET objects from the file as well as arrays of bytes.

In short, it's very clear how you can get from a StorageFile to a DataReader to read the contents of a disk file. However, if you need to use ZipArchive or XmlReader, you need a .NET Stream object, and how you get from a StorageFile to a Stream object is not so clear.

I ran into a related problem when trying to use the Windows 8 version of WriteableBitmap, and I described the solution in my blog entry SpinPaint for Windows 8. The PixelBuffer property of WriteableBitmap is an IBuffer, and I was able to use an extension method named AsStream defined in the System.Runtime.InteropServices.WindowsRuntime to convert the IBuffer into a .NET Stream object. But I still couldn't quite find the right components to get the current task to work.

It slowly dawned on me why all the file-system stuff has been removed from the Windows 8 System.IO namespace: These classes access the file system, so it's likely that they would require more than 50 milliseconds to complete. Hence, they violate the "fast and fluid" rule. The Windows.Storage.* namespaces contain more modern classes for using the file system, and they are asynchronous when necessary.

I mentioned DataReader that you can create from an IInputStream object. But after you create a DataReader, you can't just call ReadBytes on it to read a chunk of the file into a buffer. It won't work. You'll get back zero bytes. Why is that? It's because the file is still on the disk and ReadBytes obviously doesn't read that file from the disk into memory because that might well require more than 50 milliseconds, and the method is named ReadBytes rather than ReadBytesAsync. It's not an asynchronous method, which means it's not hitting the disk.

What you need to do first with the DataReader is call LoadAsync, which has an argument indicating a number of bytes that you want to transfer from the disk into memory. This method is asynchronous, and after it returns you can then call ReadBytes.

So here's a little method I wrote in a static class called Helpers:

public static Task<byte[]> ReadStorageFileAsync(StorageFile storageFile)
{
    return Task.Run<byte[]>(async () =>
    {
        ulong size = storageFile.Size;
        IInputStream inputStream = await storageFile.OpenForReadAsync();
        DataReader dataReader = new DataReader(inputStream);
        await dataReader.LoadAsync((uint)size);
        byte[] buffer = new byte[(int)size];
        dataReader.ReadBytes(buffer);
        return buffer;
    });
}

Notice that this method is asynchronous, and it calls two other asynchronous methods. You can call it like so:

byte[] buffer = await Helpers.ReadStorageFileAsync(storageFile);

From the array of bytes, you can create a .NET MemoryStream and pass that to the ZipArchive constructor or the XmlReader.Create method.

So far, my New Epublic program stops short of accessing the HTML files included in the package. The MainPage is a "library" view. To add EPUBs to the library, you press the "Add EPUB to Library" button on the application bar. This invokes a FileOpenPicker that lets you navigate around My Documents and lists any .EPUB files you might have. (Free EPUBs are available from Project Gutenberg, epubBooks.com, Web Books Publishing, and other sources.) When you select one, New Epublic opens the ZIP file to extract the title, author, and other information, and saves that together with a reference to the filename in persistent application settings. The library is a GridView. This screen shot is half size:

Yes, I know, it's not very pretty yet. Notice my book Programming Windows Phone 7 among these. You can get that free EPUB from a link on the a Microsoft Press blog.

Click any book to go into a reading view, except that so far the reading view only displays a scrollable list of chapters with (temporary) HTML references:

This view requires that the original EPUB file be accessed again, this time with a call to the static (and asynchronous) StorageFile.GetFileFromPathAsync. I discovered that while my program was allowed to load a file from the FileOpenPicker, it couldn't load that same file later with StorageFile.GetFileFromPathAsync. The solution was to change the program's capabilities stored in the Package.appxmanifest. I had to check the Document Library Access button, at which point I had to define the type of files I wanted with File Type Associations.

Doing this made my program associated with EPUB files, and the program seemed to be activated by selecting an EPUB from Windows Explorer or Internet Explorer — as long as the program was already running. Obviously I'll be exploring this feature in the future.

Meanwhile, if you're interested in looking at messy source code, here's the New Epublic project as it stands today.

Next up on the agenda: Start parsing and rendering those HTML files.