Post

SVG Violin Plot

Creating a SVG Violin to visualize distributions in Power BI

SVG Violin Plot

In my previous post I created a heat map SVG visual to visualize data distributions. In that post I mentioned Violin plot, this post describes how to create one.

Violin Plots

Firstly, what are Violin Plots? Violin plot show the distribution of data points, with the width of the curve estimating the density of points in a region. This allows the visualization multimodal data (more than one peak). These tend to be accompanied with a Box Plot to provide addition information and context. The curves of the Violin Plot are calculated using Kernel Density Estimation (KDE).

Violin Plot Wikipedia

Kernel Density Estimation (KDE)

In KDE data points are converted into kernels, where each point is represented by a distribution (normal, uniform, gaussian etc.). In the case of the normal distribution each point represents a mean (center of the curve), and the standard deviation can be used to expand the width of the distribution, allowing for smoothing. The data is sampled at uniform points across the range of the data, and the contribution each point is summed. I found this video to provide a good description.

Kernel Density Estimation (KDE) Wikipedia

Paths and Bézier Curves

Once we have calculated our KDE, we need to plot the curves. This is done using SVG paths, using Bézier Curves. Bézier Curves are defined by start and end point, plus 1 or more control points, which act like gravity pulling the curve towards them. Quadratic Bézier Curves have a single control point, Cubic have two, and so on.

Bézier Curves Link

In SVG paths the cubic curve is specified by C x1 y1, x2 y2, x y or a short-hand S x2 y2, x y which assumes the first control point is a reflection of the one used previously in the path.

End Result

Taking all the concept above, we can now generate a Violin Plot. The level of sampling and bandwidth will have to be adjusted to fit your data.

SVG Violin Plot

Command Duration Violin SVG =
VAR _SvgWidth = 150
VAR _SvgHeight = 20

VAR _samples = 50
VAR _bandwidth = 60 // Kernal width, adjust to avoid over or under fitting

VAR _range =
    MAXX(
        ALLSELECTED( ExecutionMetrics[XmlaRequestId] )
        ,CALCULATE(
            SUM( ExecutionMetrics[durationMs] ) / 1000
            ,ExecutionMetrics[LogAnalyticsCategory]= "Command"
            ,ALLSELECTED( Artifact )
        )
    )
VAR _rangePerSample = _range / _samples
VAR _values =
    ADDCOLUMNS(
        CALCULATETABLE( 
            VALUES( ExecutionMetrics[XmlaRequestId] )
            , not ISBLANK( ExecutionMetrics[durationMs] )
            , ExecutionMetrics[LogAnalyticsCategory]= "Command" 
        )
        ,"@Val", CALCULATE( SUM( ExecutionMetrics[durationMs] ) / 1000, ExecutionMetrics[LogAnalyticsCategory]= "Command" )
    )
VAR _numValues = COUNTAX( _values, [@Val] )
VAR _KDE = // Kernal Density Estimation
    ADDCOLUMNS(
        GENERATESERIES( 0, _samples + 1, 1 )
        ,"@inputX", _rangePerSample * [Value]
        ,"@KDE", ( 1 / _numValues ) * SUMX( _values, NORM.DIST( _rangePerSample * [Value], [@val], _bandwidth, false ) ) // Normal distribution for Kernal (Gaussian, Uniform, Triangular etc.)
    )
VAR _maxKDE = MAXX( _KDE, [@KDE] )
VAR _points =
    ADDCOLUMNS(
        _KDE
        ,"@x",
            VAR _xInputStart =           0                         // The lowest number of the range input
            VAR _xInputEnd =             _range                    // The largest number of the range input
            VAR _xOutputStart =          0                         // The lowest number of the range output
            VAR _xOutputEnd =            _SvgWidth                 // The largest number of the range output         
            VAR _xOutputVal =            _xOutputStart + ((_xOutputEnd - _xOutputStart) / (_xInputEnd - _xInputStart)) * ( [@inputX] - _xInputStart)
            RETURN
            _xOutputVal
        ,"@y",
            VAR _yInputStart =           0                         // The lowest number of the range input
            VAR _yInputEnd =             _maxKDE                   // The largest number of the range input
            VAR _yOutputStart =          _SvgHeight / 2            // The lowest number of the range output
            VAR _yOutputEnd =            0                         // The largest number of the range output         
            VAR _yOutputVal =            _yOutputStart + ((_yOutputEnd - _yOutputStart) / (_yInputEnd - _yInputStart)) * ([@KDE] - _yInputStart)
            RETURN
            _yOutputVal
    )
VAR _pointsAndPreviousPoint = 
    NATURALLEFTOUTERJOIN(
        _points
        ,SELECTCOLUMNS(
            _points
            ,"Value", [Value] + 1
            ,"@prevX", [@x]
            ,"@prevY", [@y]
        )
    )
VAR _plusControlPoint = // Calculate control point for cubic Bézier curve. Assumes they half way between points, at the same height as the current point
    ADDCOLUMNS(
        _pointsAndPreviousPoint
        ,"@cx", [@prevX] + (([@x] - [@prevX]) / 2)
        ,"@cy", [@y]
    )
VAR _topCurves =
    CONCATENATEX(
        _plusControlPoint
        ,"S " & [@cx] & " " & [@cy] & ", " & [@x] & " " & [@y] // cubic Bézier curve
        , " "
        , [Value]
   )
VAR _bottomCurves = // top curve inverted
    CONCATENATEX(
        _plusControlPoint
        ,"S " & [@cx] & " " & (_SvgHeight / 2) - ([@cy] - (_SvgHeight / 2)) & ", " & [@x] & " " & (_SvgHeight / 2) - ([@y] - (_SvgHeight / 2))
        , " "
        , [Value]
   )
VAR _svg =
    "data:image/svg+xml;utf8, <svg width=""" & _SvgWidth & """ height=""" & _SvgHeight &""" xmlns="http://www.w3.org/2000/svg">" &
    "<path d=' M 0 " & (_SvgHeight / 2) & " " & _topCurves & "M " & _SvgWidth & " " & (_SvgHeight / 2) & "' stroke='gray' fill='gray' fill-opacity='50%'/>" &
    "<path d=' M 0 " & (_SvgHeight / 2) & " " & _bottomCurves & "M " & _SvgWidth & " " & (_SvgHeight / 2) & "' stroke='gray' fill='gray' fill-opacity='50%'/>" &
    "</svg>"
RETURN
IF( not ISBLANK( _maxKDE ), _svg )

Conclusion

This visual is more well known than the previous heatmap, so might be more parsable by end users. This comes at the cost of some extra computation. In any case the generation of this visuals involves some interesting concepts that are worth finding more about.

This post is licensed under CC BY 4.0 by the author.