Tuesday, April 25, 2017

Excel Power Query and CSV files

Power Query (or Get and Transform as it is called now) has some interesting tools to import a bunch of CSV files. Here are some notes.

Line Endings

Here is a small set of CSV files:

image

When we load this with New Query|From File|From Folder and Edit we see:

image

Pressing the button with the arrows in the Content column leads to:

image

The reason for this behavior is: the last line of each CSV file does not have a proper line ending. Most CSV readers are forgiving in this respect, Excel (in this case) is not.

image

The auto-generated script (shown above) is combining the content of the CSV files using the Binary.Combine function. I don’t know an easy way to fix this in the generated import script so instead I fixed the CSV files themselves, so we can concentrate on the handling of headers.

Headers

If we have headers in the CSV files, as in:

image

we can promote the first record of the combined data to form our column names. Unfortunately, there will be remaining headers in the records:

image

These header records can be filtered out explicitly. After applying the row filter and fixing up the column types, we have:

image

I changed the types of column ID, Population and Year to “whole number”. Note: it is not always clear if an ID should be a numeric value. There are good reasons to keep it text so we cannot apply numeric operations on this column.

The generated script looks like:

let
    Source = Folder.Files("D:\tmp\citypopulation"),
    #"Combined Binaries" = Binary.Combine(Source[Content]),
    #"Imported CSV" = Csv.Document(#"Combined Binaries",[Delimiter=",", Columns=4, Encoding=1252, QuoteStyle=QuoteStyle.None]),
    #"Promoted Headers" = Table.PromoteHeaders(#"Imported CSV"),
    #"Filtered Rows" = Table.SelectRows(#"Promoted Headers", each [ID] <> "ID"),
    #"Changed Type" = Table.TransformColumnTypes(#"Filtered Rows",{{"Population", Int64.Type}, {"Year", Int64.Type}, {"ID", Int64.Type}})
in
    #"Changed Type"

 

Filename as column

In this case we also want to the country name as extra column in the final table. This name can be deduced from the file name. This type of operation is not unusual when working with CSV files. The following script will do the job:

let
    Source = Folder.Files("D:\tmp\citypopulation"),
    #"Added Column 1" = Table.AddColumn(Source, "CountryName", each Text.Replace([Name],[Extension],"")),
    #"Added Column 2" = Table.AddColumn(#"Added Column 1", "Data", each Table.PromoteHeaders(Csv.Document([Content], [Delimiter=",", Columns=4, Encoding=1252, QuoteStyle=QuoteStyle.None]))),
    #"Select Columns" = Table.SelectColumns(#"Added Column 2",{"CountryName", "Data"}),
    Expanded = Table.ExpandTableColumn(#"Select Columns", "Data", {"ID", "City", "Population", "Year"}, {"ID", "City", "Population", "Year"}),
    #"Changed Type" = Table.TransformColumnTypes(Expanded,{{"ID", Int64.Type}, {"Population", Int64.Type}, {"Year", Int64.Type}})   
in
    #"Changed Type"

Basically the algorithm is:

  1. Source: Read directory information.
  2. Added Column 1: add column “CountryName” by dropping the extension from the filename (i.e. “china.csv” becomes “china”).
  3. Added Column 2: add a column “Data” that will contain the content of each CSV file (a so-called table column).
  4. Select Columns: keep only these two columns.
  5. Expanded: Expand the “Data” table column so that columns inside the “Data” tables become columns of our main table. This is a pretty powerful function.
  6. Changed Type: fix up the column types.

This results in:

image

With this script we actually no longer need to worry about line endings of the last record in each file: the Csv.Document function (inside the “Added Column 2” step) will take care of that.

Extract only *.CSV files from directory

If the directory contains other files than *.csv files we can apply a row filter on the source table we get from the Folder.Files function. Let’s also make sure we handle both *.csv and *.CSV files:

let
    Source = Folder.Files("D:\tmp\citypopulation"),
    CsvFiles = Table.SelectRows(Source, each Text.Lower([Extension]) = ".csv"),
    . . .

or if we don’t want to introduce an extra step:

let
    Source =
Table.SelectRows(Folder.Files("D:\tmp\citypopulation"), each Text.Lower([Extension]) = ".csv"),
    . . .

Note that it is not allowed to use something like: Folder.Files("D:\tmp\citypopulation\*.csv").

The M Language

To work with the scripts it is useful to know a little bit about the M language, see (3) and (4). Microsoft is very good in creating texts with a somewhat high PR level. The next sentences are from (4):

The Power Query Formula Language is a powerful query language optimized for building queries that mashup data. It's a functional, case sensitive language similar to F#, which can be used with Power Query in Excel, Get & Transform in Excel 2016, and Power BI Desktop.

For computer language theorists: Power Query is a mostly pure, higher-order, dynamically typed, partially lazy, functional language. You can create a variety of data mashup queries from simple to advanced. As with any programming language, you will need some intermediate programming experience for more advanced scenarios. 

References

  1. Data is from https://www.citypopulation.de/
  2. Imke Feldmann, Power BI: Open multiple CSV files in a folder at once with transactions in Power Query, https://www.youtube.com/watch?v=whu1CsoZn1Q&feature=youtu.be for a slightly different approach regarding the last script
  3. Microsoft, Power Query M Reference, https://msdn.microsoft.com/en-us/library/mt211003.aspx
  4. Microsoft, Introduction to Power Query (informally known as "M") Formula Language, https://msdn.microsoft.com/en-us/library/mt270235.aspx

No comments:

Post a Comment