.net

Investigating corrupt pdf

An interesting support ticket slammed the bug tracker today, a user was claiming some of our software is corrupting *.pdf files.

To give a bit of context, the blamed software is a pickup service that collects all the *.pdf files from a given directory and, based on some business rules, dispatches them by email, afterwards it moves them to an archive directory.

Initially, I was tempted to blame the antivirus software, as we previously had issues with the corporate antivirus screwing emails and other such unpleasantries. However, I decided to investigate the issue all the way, and draw the conclusions afterwards.

A phone call with the user reporting the issue revealed the following:

  • this has been an issue from the first day the software was deployed into production <- this ruled out the possibility of being tied to a newer version of the pickup service (no need for me to check the recent commit history in git, as I won’t find anything there)
  • not all the pdf files get corrupted (only around 10%) <- this means it’s not a general bug in the code, but on the contrary it’s a particular case (special characters maybe?)
  • a corrupted pdf file was emailed to me
  • full user scenario: the user opens a pdf, applies a digital signature, then saves the pdf in the directory monitored by the pickup service

Armed with the above, I went into the code that handles the pickup and navigated all the way to the place where the attachments are being specified and the email gets sent. Nothing suspicious, everything looked fine and my senses could not detect any code smell that could lead to the attached file being corrupted.

The code looks good, at this point I have no theory of what is going wrong, so I went fishing… Searched the archive directory (the place where the service places the pdfs that have been successfully processed) and managed to find the working (non corrupted) version of the pdf file which was emailed to me.

Having the two files, I could now open them in a text editor and compare their contents. The textual comparison revealed only small differences at the beginning (header) and the end of the file (trailer) [see the pdf file structure here]. While I find it normal to maybe have differences in the header, it sounds weird to have differences in the trailer. Looking more into the textual representation of the pdf, revealed the “/Name /PDF-XChange-Pro” sequence, and the fact that the last line of the working pdf file is “%%EOF”, while the corrupted pdf  has “stream” as the last line in the file. My mind began the blaming game: “this PDF-XChange Pro program that they use is maybe flawed and errors from time to time when it saves the pdf to the disk“. But it didn’t made sense, if you are a software that creates/edits pdf files and you cannot save them to the disk, you should not exist at all, nor have a Pro version. Therefore, the whole idea of the PDF-XChange having a bug when persisting the pdf seemed unrealistic…

Took a step back, then I realized that the pickup service runs at fixed time intervals. Every 10 seconds it will check the contents of the pickup directory and process the existing pdf files.

What if the pickup service tries to attach a pdf while PDF-XChange hasn’t finished writing it to the disk? this will cause the file to look as being corrupt, and will of course cause it not to have the “%%EOF” as the last line in the file.

The theory sounds plausible and an investigation is due. So, I aimed the mighty Process Monitor at the process that serves the PDF-XChange application and tried to write a pdf file to the disk. Got the following:

PDF-XChange-DiskActivity

PDF-XChange – Disk activity

When PDF-XChange  opens the file for writing the contents of the pdf, it opens it by specifying ShareMode: Read. This means any other process that wants to read the file while PDF-XChange writes to it, is more than welcomed to do so. Cool guy PDF-XChange, trying to make everybody happy, however, this probably causes the pickup service to read incomplete pdfs.

There is only one more step left into calling this case closed, to actually check how the pdfs are being attached to emails by the pickup service, to see if any unexpected file access modes are specified there. As expected, attaching the files to the email is done by a call to File.ReadAllBytes, which specifies FileAccess.Read as access mode, which is compliant with the sharing mode specified by PDF-XChange, therefore incomplete pdfs.

        [System.Security.SecuritySafeCritical]  // auto-generated
        [ResourceExposure(ResourceScope.Machine)] 
        [ResourceConsumption(ResourceScope.Machine)]
        public static byte[] ReadAllBytes(String path)
        {
            byte[] bytes; 
            using(FileStream fs = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.Read)) {
                // Do a blocking read 
                int index = 0; 
                long fileLength = fs.Length;
                if (fileLength > Int32.MaxValue) 
                    throw new IOException(Environment.GetResourceString("IO.IO_FileTooLong2GB"));
                int count = (int) fileLength;
                bytes = new byte[count];
                while(count > 0) { 
                    int n = fs.Read(bytes, index, count);
                    if (n == 0) 
                        __Error.EndOfFile(); 
                    index += n;
                    count -= n; 
                }
            }
            return bytes;
        } 

In conclusion, even though sometimes overlooked, attention should be paid to the FileShare parameter, as it plays a key role when it comes to sharing files with other processes, or when you want to open a file multiple times.

To further clarify this issue, I will perform an analysis of the possible FileShare modes in a future post.

Testing objects for equality or similarity

There are times when you need to test the equality or similarity of two objects. Most of the times you encounter this need while unit testing, when you craft some input data, call a method to do some work on it, and assert on the result.

Presenting Deep Equal https://github.com/jamesfoster/DeepEqual

Deep equal does exactly what it says it does, it compares objects for equality, while supporting nested objects, skipping specific properties, skipping extra properties, and being open for extensibility (via implementing its IComparison interface).

Checking for equality

var a = new { SomeProperty = "some property value" };
var b = new { SomeProperty = "some property value" };

var areEqual = a.IsDeepEqual(b);

The areEqual variable will hold true, as both a and b have only one property, and the value of that property is the same for both.

Ignoring extra properties

var a = new { SomeProperty = "some property value"};
var b = new{SomeProperty = "some property value", SomeOtherProperty = "some other property value"};
var areEqual = a.IsDeepEqual(b);

The areEqual variable will hold false, as b has an extra property.

You can configure DeepEqual to ignore extra properties by calling

a.WithDeepEqual(b).IgnoreUnmatchedProperties().Compare();

Ignoring specific properties

Moreover, you can decide to ignore a property with one/both of the following constructs

a.WithDeepEqual(b).IgnoreSourceProperty(x => x.SomeProperty).Assert();
a.WithDeepEqual(b).IgnoreDestinationProperty(x => x.SomeProperty).Assert();

Custom comparisons

DeepEqual can be extended by implementing IComparison

Here’s an example where case insensitive ordinal string comparison is desired for all string types.

class CustomStringComparison : IComparison
{
private readonly StringComparison _comparisonType;

public CustomStringComparison(StringComparison comparisonType)
{
  _comparisonType = comparisonType;
}

public bool CanCompare(Type type1, Type type2)
{
  return type1 == typeof(string) && type2 == typeof(string);
}

public ComparisonResult Compare(IComparisonContext context, object value1, object value2)
{
  var value1Str = (string)value1;
  var value2Str = (string)value2;

  if (string.Equals(value1Str, value2Str, _comparisonType))
    return ComparisonResult.Pass;

  context.AddDifference(new Difference
  {
    Breadcrumb = context.Breadcrumb,
    Value1 = value1,
    Value2 = value2
  });

  return ComparisonResult.Fail;
  }
}

var a = new { SomeProperty = "some property value" };
var b = new { SomeProperty = "SOME property value" };

a.WithDeepEqual(b).WithCustomComparison(new CustomStringComparison(StringComparison.OrdinalIgnoreCase)).Assert();

In the above scenario all the string values will be piped in the CustomStringComparison which in turn will perform the configured string comparison. However,  sometimes you may want to employ a different comparison behavior based on the actual property you are checking, for instance all the properties named “MyProperty” should be checked in a case sensitive manner, while for all the others the comparison should be case insensitive. You can achieve this by checking the IComparisonContext.Breadcrumb property, which depicts the path to the currently compared value.

Moreover, if your IComparison implementation cannot tell if two values are equal, it can just return ComparisonResult.Inconclusive and DeepEqual will continue probing the other registered IComparison implementations (including its own).

 

 

Task Parallel Library – TaskScheduler, Threads, and Deadlocks

Task Parallel Library makes your code perform faster, look nicer (aka review), and easier to maintain (aka fix bugs, and extend). However, there are a few things you need to be aware of while using it, otherwise you may end up with a strangely behaving application, or worse, you will find yourself forever waiting for a task to be executed <– not cool, not cool at all.

The mighty TaskScheduler

One of the most important notions you need to grasp when using tasks is TaskScheduler. The TaskScheduler represents a mechanism that is responsible for scheduling tasks for execution.

At the time of writing this post, there are 3 TaskSchedulers in .NET framework 4.5:

  • ThreadPoolTaskScheduler

    • uses the ThreadPool to schedule tasks for execution
    • the scheduler by itself is not very smart nor does it do a lot, however the ThreadPool is extremely smart and fast. A few noticeable features of it are: lock-free mechanism for storing/retrieving user work items, work stealing (threads from the tread pool will steal work from each other when they have nothing to do) – you can read more here
  • SynchronizationContextTaskScheduler

    • uses the current synchronization context (SynchronizationContext.Current) and posts the task to it – you can read more here
    • If called from the UI thread, Silverlight, Windows Presentation Foundation, and WinForms, SynchronizationContext.Current will return a synchronization context that will always execute work on the UI thread
    • used by calling TaskScheduler.FromCurrentSynchronizationContext()
    • VERY IMPORTANT – there are multiple implementations of the SyncronizationContext, and it is not always safe to assume that a synchronization context is bound to only one thread, therefore when using TaskScheduler.FromCurrentSynchronizationContext() you need to be on the UI thread, this is the only way to guarantee that you will retrieve back a task scheduler that schedules tasks on the UI thread
  • ConcurrentExclusiveTaskScheduler

    • Is a hybrid that operates in two modes
      • ProcessingExclusiveTask – processes the scheduled tasks in an exclusive manner (one at a time, no two tasks run at the same time)
      • ProcessingConcurrentTasks – processes the scheduled tasks in a concurrent manner, and allows to specify a max concurrency level (how many tasks can run at the same time)
    • Used internally by ConcurrentExclusiveSchedulerPair
    • You can read more here

 

While creating tasks and continuations you are very likely to stumble upon two static properties of the TaskScheduler

  1. TaskScheduler.Default
    • Returns an instance of the ThreadPoolTaskScheduler
  2. TaskScheduler.Current
    • If called from within an executing task will return the TaskScheduler of the currently executing task
    • If called from any other place will return TaskScheduler.Default

 

Threads

Depending on how you create and run your tasks, different schedulers will be used, thus a variating threading behavior will be exhibited.

private void InstantiateAndStart()
{
  var task = new Task(() => { });
  task.Start(); 
  //will use TaskScheduler.Current to schedule the task for execution
}

private void UsingTheTaskFactory()
{
  Task.Factory.StartNew(() => { }); 
  //will use TaskScheduler.Current to schedule the task for execution
}

private void UsingTaskRun()
{
  Task.Run(() => { });
  //will use TaskScheduler.Default to schedule the task for execution
}

private void ExecutingOnANewThread()
{
  //both tasks defined bellow will be executed on a new background thread
  var task = new Task(() => { }, TaskCreationOptions.LongRunning);
  task.Start(TaskScheduler.Default);

  //or

  var task2 = Task.Factory.StartNew(() => { }, CancellationToken.None, TaskCreationOptions.LongRunning, TaskScheduler.Default);
}

Important facts

  • starting a new task does not necessary spawn a brand new thread, in certain conditions it will execute on background thread, and in others will execute on the same thread that it was started. It all depends on the TaskScheduler that gets used to schedule the task.
  • if you want your task to execute on a brand new thread, use the ExecutingOnANewThread example above, and keep in mind that it is up for the task scheduler to decide what to do with your TaskCreationOptions.LongRunning. The ThreadPoolTaskScheduler will always create a new background thread when presented with this task creation option, however other task schedulers may exhibit a different behavior.
  • the constructs that use the ambiental TaskScheduler.Current can easily become tricky, therefore it is recommended to use the UsingTaskRun example when you want your task to execute asynchronously, and always specify your TaskScheduler when using the other constructs, as it makes your code more explicit, and less likely for people to get in trouble when working with it.

Deadlocks and unexpected behaviors

Trying to do something on a background thread and ending up executing on the UI thread

Take the following example

Task.Factory.StartNew(() => PerformSlowOperation())
.ContinueWith(loadTask =>
{
   //load the loadTask.Result in the UI
   //now you decide you want to do some more slow operations and start a new task

   Task.Factory.StartNew(() => PerformSlowOperation());
}, TaskScheduler.FromCurrentSynchronizationContext());

The example illustrates a simple scenario in which a task is started to perform a slow operation, then a continuation is hocked up to the task to print the results to the screen, hence the usage of TaskScheduler.FromCurrentSynchronizationContext(), afterwards a new slow operation is triggered.

Q: On which thread will the second invocation of the PerformSlowOperation execute, background or UI thread?

A: UI thread

Why:

  1. Task.Factory.StartNew uses TaskScheduler.Current, on the first invocation of PerformSlowOperation, the TaskScheduler.Current was not defined, therefore TaskScheduler.Default was used => execution on a background thread
  2. The continuation was executed on the UI thread, because it was requested by specifying the synchronization context specific TaskScheduler
  3. When PerformSlowOperation was called the second time, the TaskScheduler.Current was no longer undefined, but it was actually pointing to the TaskScheduler of the continuation, and the TaskScheduler of the continuation schedules work to be executed on the UI thread.

Fix:

  • Use a task construct that can take a task scheduler when executing PerformSlowOperation the second time
  • Use TaskCreationOption.HideScheduler when executing PerformSlowOperation the second time
  • etc

Deadlocking

Consider the following case

new Task(() =>
{
   //executing on the UI thread

   //starting a new task and waiting for it
   var otherTask = Task.Factory.StartNew(() => PerformSlowOperation());
   otherTask.Wait();
}).Start(TaskScheduler.FromCurrentSynchronizationContext());

The above example illustrates a scenario when a task is executing on the UI thread, then decides to spawn a new task to do some lightweight work and wait for it to finish.

Q: What will happen when the above code gets run?

A: The application will hang forever

Why: Thanks to TaskScheduler.Current the invocation of the PerformSlowOperation method will be scheduled on the UI thread, but it will never get the chance to start, because the current task, which also executes on the UI thread, is blocking the thread waiting for it to finish executing.

Fix:

  • Do not wait on the task, setup a continuation
  • Use a task construct that can take a task scheduler for otherTask
  • Use TaskCreationOption.HideScheduler on the otherTask
  • etc

The above is by no means an extensive list of what can go wrong when using tasks, however the benefits outweigh the potential pitfalls, and one should not be afraid of using tasks. However, if you find yourself in a tricky situation remember that your are not alone, Visual Studio includes a task visualizer that can help you understand what’s happening.

Visual Studio Task Visualizer

 

The above print screen of the Task Visualizer depicts the deadlocking example. You can observe the first task executing on the main thread, and the second task being just scheduled for execution – too bad it will never get executed. But the Task Visualizer doesn’t necessary indicate that we are in a deadlock situation, it is still up to the developer to switch to each task, understand how they relate to each other, and identify the deadlock.