Quantcast
Channel: VBForums - CodeBank - Visual Basic 6 and earlier
Viewing all articles
Browse latest Browse all 1513

Simple Statistics

$
0
0
I've started a similar CodeBank thread before, but I'm now thinking I went too complex, as there was no interest. Just looking around earlier today, I saw a request under a CodeBank entry by The Trick. I didn't address all the requests in that entry, but I did address some of them. Maybe, if this has some interest, I'll develop some quartile/percentile functions as well as others.

Basically, I've just provided some one-sample statistical functions. I've also made a decision on how to handle missing values. I've struggled with this in VB6. One option is certainly the use of Variant. However, I've never been terribly happy with that option. Therefore, I've decided on sticking with Double arrays for my data, and using the IEEE Double NaN value to denote missing values. This can be seen in the code.

Now, for the uninitiated, NaN values can be a bit tricky. They're somewhat similar to the Null value, but even more restrictive. Once you get a NaN, you can continue to do math with it, but the results will be NaN (similar to Null in Variants). However, you can't do Boolean comparisons with a NaN. In other words, they'll crash if used in an If statement. Therefore, anyone using these functions, needs to develop a practice of checking return values with the IsNan() function. This will keep you out of trouble.

Now, most of what I did today is straight-forward. However, I did dip into calculating a p-value (and confidence intervals), which requires "distributions". I've leaned on the ALGLIB project to derive my PDF (probability distribution function [not portable document format]) and CDF (cumulative distrubution function) values.

The first part doesn't require this though. I've attached a complete project. All is tested, but I didn't really develop much of an interface. If you're interested, focus first on the modSimpleStats module. Here's the part of that module that doesn't use distributions. It's stand-alone:

Code:

Option Explicit
'
Private Declare Sub GetMem1 Lib "msvbvm60.dll" (ByRef Source As Any, ByRef Dest As Any)
Private Declare Sub GetMem2 Lib "msvbvm60.dll" (ByRef Source As Any, ByRef Dest As Any)
Private Declare Sub GetMem4 Lib "msvbvm60.dll" (ByRef Source As Any, ByRef Dest As Any)
Private Declare Sub GetMem8 Lib "msvbvm60.dll" (ByRef Source As Any, ByRef Dest As Any)
Private Declare Function ArrPtr Lib "msvbvm60.dll" Alias "VarPtr" (Ptr() As Any) As Long
'

' *******************************************
' *******************************************
'
' We start with some "helper" functions.
'
' *******************************************
' *******************************************

Public Function NaN() As Double
    ' Math can be done on these, but nothing changes.
    ' They can NOT be used in "if NaN = NaN Then", or an overflow will result.  Use IsNaN().
    '
    Const bb7 As Byte = &HF8
    Const bb8 As Byte = &HFF
    '
    GetMem1 bb7, ByVal VarPtr(NaN) + 6&
    GetMem1 bb8, ByVal VarPtr(NaN) + 7&
End Function

Public Function IsNaN(d As Double) As Boolean
    ' Infinity also returns TRUE, but we shouldn't be running across infinities.
    '
    Static bb(1 To 8) As Byte
    Const bb7 As Byte = &HF0    ' High 4 bits of byte #7. \
    Const bb8 As Byte = &H7F    ' Low  7 bits of byte #8. /  If all on, it's NaN (or Inf if all other non-sign bits are zero).
    '
    GetMem8 d, bb(1)
    IsNaN = ((bb(7) And bb7) = bb7) And ((bb(8) And bb8) = bb8)
End Function

Public Sub ChangeMissingToNaN(d() As Double, Optional MissingValue As Double = 0&)
    ' This changes the array "in place" to save memory.
    ' Just call:    ChangeMissingToNaN YourArray
    ' Or:          Call ChangeMissingToNaN(YourArray, MissingValue)
    '
    Dim i As Long
    '
    If DblDims(d) <> 1 Then Exit Sub
    For i = LBound(d) To UBound(d)
        If Not IsNaN(d(i)) Then If d(i) = MissingValue Then d(i) = NaN
    Next i
End Sub

Public Function DblDims(dArray() As Double) As Integer
    ' Works on both Static and Dynamic arrays.
    Dim pSA As Long
    '
    GetMem4 ByVal ArrPtr(dArray), pSA
    If pSA <> 0& Then GetMem2 ByVal pSA, DblDims
End Function

' *******************************************
' *******************************************
'
' And now, just some simple statistics.
'
' *******************************************
' *******************************************

Public Function Count(d() As Double) As Long
    ' Returns 0 if not dimensioned.
    ' Skips any NaNs and INFs in the array.
    '
    Dim i As Long
    '
    If DblDims(d) <> 1 Then Exit Function
    For i = LBound(d) To UBound(d)
        If Not IsNaN(d(i)) Then Count = Count + 1&
    Next i
End Function

Public Function Sum(d() As Double) As Double
    ' Returns NaN if d() isn't dimensioned.
    ' Skips any NaNs and INFs in the array.
    ' If there are no valid numbers, NaN is returned.
    '
    Dim i As Long
    Dim cnt As Long
    '
    If DblDims(d) <> 1 Then Sum = NaN: Exit Function
    For i = LBound(d) To UBound(d)
        If Not IsNaN(d(i)) Then
            cnt = cnt + 1
            Sum = Sum + d(i)
        End If
    Next i
    If cnt = 0& Then Sum = NaN
End Function

Public Function Mean(d() As Double) As Double
    ' Returns NaN if d() isn't dimensioned.
    ' Skips any NaNs and INFs in the array.
    ' If there are no valid numbers, NaN is returned.
    '
    Dim cnt As Long
    '
    cnt = Count(d)
    If cnt = 0& Then Mean = NaN: Exit Function
    Mean = Sum(d) / cnt
End Function

Public Function SumSq(d() As Double) As Double
    ' Returns NaN if d() isn't dimensioned.
    ' Skips any NaNs and INFs in the array.
    ' If there are no valid numbers, NaN is returned.
    '
    Dim i As Long
    Dim cnt As Long
    '
    If DblDims(d) <> 1 Then SumSq = NaN: Exit Function
    For i = LBound(d) To UBound(d)
        If Not IsNaN(d(i)) Then
            cnt = cnt + 1
            SumSq = SumSq + d(i) * d(i)
        End If
    Next i
    If cnt <> 0 Then SumSq = NaN
End Function

Public Function SumSqDiff(d() As Double) As Double
    ' This one is the sum-of-squared-differences-from-the-mean.
    ' Returns NaN if d() isn't dimensioned.
    ' Skips any NaNs and INFs in the array.
    ' If there are no valid numbers, NaN is returned.
    '
    Dim i As Long
    Dim m As Double
    '
    m = Mean(d)
    If IsNaN(m) Then SumSqDiff = NaN: Return
    For i = LBound(d) To UBound(d)
        If Not IsNaN(d(i)) Then
            SumSqDiff = SumSqDiff + (d(i) - m) * (d(i) - m)
        End If
    Next i
End Function

Public Function VariancePop(d() As Double) As Double
    VariancePop = MeanSqPop(d)
End Function

Public Function MeanSqPop(d() As Double) As Double
    ' Mean of squared differences based on POPULATION of numbers.
    ' This is also know as the VARIANCE.
    ' This one is for population (all items counted).
    '
    ' Returns NaN if d() isn't dimensioned.
    ' Skips any NaNs and INFs in the array.
    ' If there are no valid numbers, NaN is returned.
    '
    Dim cnt As Long
    '
    cnt = Count(d)
    If cnt = 0& Then MeanSqPop = NaN: Exit Function
    MeanSqPop = SumSqDiff(d) / cnt
End Function

Public Function VarianceSamp(d() As Double) As Double
    VarianceSamp = MeanSqSamp(d)
End Function

Public Function MeanSqSamp(d() As Double) As Double
    ' Mean of squared differences based on SAMPLE of numbers.
    ' This is also know as the VARIANCE.
    ' This one is for sample of items (sampled from some population).
    '
    ' Returns NaN if d() isn't dimensioned.
    ' Skips any NaNs and INFs in the array.
    ' If there are less than 2 valid numbers, NaN is returned.  Valid N must be > 1 to calculate for SAMPLE.
    '
    Dim cnt As Long
    '
    cnt = Count(d)
    If cnt < 2& Then MeanSqSamp = NaN: Exit Function
    MeanSqSamp = SumSqDiff(d) / (cnt - 1&)
End Function

Public Function StDevPop(d() As Double) As Double
    ' Standard deviation based on POPULATION of numbers.
    '
    ' Returns NaN if d() isn't dimensioned.
    ' Skips any NaNs and INFs in the array.
    ' If there are no valid numbers, NaN is returned.
    '
    StDevPop = MeanSqPop(d)
    If IsNaN(StDevPop) Then Exit Function
    StDevPop = Sqr(StDevPop)
End Function

Public Function StDevSamp(d() As Double) As Double
    ' Standard deviation based on SAMPLE of numbers.
    '
    ' Returns NaN if d() isn't dimensioned.
    ' Skips any NaNs and INFs in the array.
    ' If there are less than 2 valid numbers, NaN is returned.  Valid N must be > 1 to calculate for SAMPLE.
    '
    StDevSamp = MeanSqSamp(d)
    If IsNaN(StDevSamp) Then Exit Function
    StDevSamp = Sqr(StDevSamp)
End Function

Public Function StErr(d() As Double) As Double
    ' Standard error of the mean (aka, standard error).
    ' This has no population equivalent.
    '
    ' Returns NaN if d() isn't dimensioned.
    ' Skips any NaNs and INFs in the array.
    ' If there are less than 2 valid numbers, NaN is returned.  Valid N must be > 1 to calculate.
    '
    StErr = MeanSqSamp(d)
    If IsNaN(StErr) Then Exit Function
    StErr = Sqr(StErr / Count(d))
End Function

Public Function OneSampleStudentT(d() As Double, Optional TestVal As Double = 0&) As Double
    ' Returns NaN if d() isn't dimensioned.
    ' Skips any NaNs and INFs in the array.
    ' If there are less than 2 valid numbers, NaN is returned.  Valid N must be > 1 to calculate.
    '
    Dim se As Double
    '
    se = StErr(d)
    If IsNaN(se) Then OneSampleStudentT = NaN: Exit Function
    OneSampleStudentT = (Mean(d) - TestVal) / se
End Function

And here's a continuation of that module, but this part does require distributions:

Code:

' *******************************************
' *******************************************
'
' From here down requires the distributions.
' Most of which were developed from the ALGLIB project.
'
' *******************************************
' *******************************************

Public Function OneSampleTTestPValue(d() As Double, Optional TestVal As Double = 0&, Optional Tails As Long = 2&) As Double
    ' A T-test can be performed either ONE-tailed or TWO-tailed.
    ' This returns the p value, the probability of observing these data if the null hypothesis is true.
    ' If you specify ONE-tailed, you should evaluate the mean, and only consider changes in ONE-DIRECTION from your TestVal as statistically significant.
    '
    ' Returns NaN if d() isn't dimensioned.
    ' Skips any NaNs and INFs in the array.
    ' If there are less than 2 valid numbers, NaN is returned.  Valid N must be > 1 to calculate.
    '
    Dim t As Double
    '
    If Tails < 1& Or Tails > 2& Then Error 6
    t = OneSampleStudentT(d, TestVal)
    If IsNaN(t) Then OneSampleTTestPValue = NaN: Exit Function
    OneSampleTTestPValue = (1# - StudentTCdf(t, Count(d) - 1&)) * Tails
End Function

Public Sub OneSampleConfInt(d() As Double, LoValOut As Double, HiValOut As Double, Optional pCrit As Double = 0.05, Optional Tails As Long = 2&)
    ' As with a T-test, confidence intervals can be constructed either ONE-tailed or TWO-tailed.
    ' However, if you specify ONE-tailed, you should either use LoValOut or HiValOut, but not both.
    ' If TWO-tailed is specified, you would use both LoValOut and HiValOut to construct your confidence interval.
    '
    ' pCrit is the equivalent p-value for your confidence intervals.
    ' For instance, for a 95% CI, we'd specify pCrit = .05.
    '              for a 90% CI, we'd specify pCrit = .10.
    ' pCrit must be in the range of 0 < pCrit < .5 for TWO-tailed; and 0 < pCrit < 1 for ONE-Tailed.
    '
    ' Returns NaN if d() isn't dimensioned.
    ' Skips any NaNs and INFs in the array.
    ' If there are less than 2 valid numbers, NaN is returned.  Valid N must be > 1 to calculate.
    '
    Dim cnt As Long
    Dim tCrit As Double
    Dim se As Double
    Dim m As Double
    '
    ' Validations.
    If Tails < 1& Or Tails > 2& Then Error 6
    If pCrit <= 0# Then Error 6
    If pCrit >= 1# Then Error 6
    If Tails = 2& And pCrit >= 0.5 Then Error 6
    '
    cnt = Count(d)
    If cnt < 2& Then LoValOut = NaN: HiValOut = NaN: Exit Sub
    tCrit = StudentTCdfInv(1# - (pCrit / Tails), cnt - 1&)
    se = StErr(d)
    m = Mean(d)
    '
    LoValOut = m - tCrit * se
    HiValOut = m + tCrit * se
End Sub

And, as stated, complete "run-able" project is attached.

Please feel free to make additional requests, and I'll possibly add them.

Take Care,
Elroy
Attached Files

Viewing all articles
Browse latest Browse all 1513

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>