Package picalo :: Module Benfords
[show private | hide private]
[frames | no frames]

Module picalo.Benfords

The Benfords module performs digital analyses on data sets. In the 1930's, Benford realized that many number sets (now known to include invoice amounts and stock prices) followed a certain pattern. A 1 appeared as the first digit about 30 percent of the time. Each digit in the number has a probability associated with which number it might be.

Numbers that are created by people (who obviously don't know about Benford's Law) do not follow Benford's distribution. In recent years, Benford's Law has been used to separate values that occur naturally in business and those that are fabricated.
Classes
Result A simple class to represent a result statistic for an individual data point

Function Summary
Table analyze(table, col, number_of_significant_digits)
Performs a Benford's analysis on a table column.
float calc_benford(position, digit, base)
Helper function that codes benford's actual formula The generalized formula was found at http://www.mathpages.com/home/kmath302/kmath302.htm This method calculates the probability at a given digit is in a given position.
float get_expected(number, number_of_significant_digits)
Calculates Benford's expected probability for a given number to a certain number of digits.

Variable Summary
tuple __functions__ = ('analyze', 'calc_benford', 'get_expecte...

Function Details

analyze(table, col, number_of_significant_digits=1)

Performs a Benford's analysis on a table column. Returns a picalo table. The frequency and expected values will be 0 for items that were not analyzed (due to insufficient digits).

Important: if you ask for 2 significant digits, any input numbers that do not have two digits are ignored and not included in the results. If these numbers were not ignored, the analysis would throw errors.

Example:
>>> table = Table([('col000', unicode), ('col001', int), ('col002', int)], [
...             ['Dan',10,8],
...             ['Sally',12,12],
...             ['Dan',11,15], 
...             ['Sally',12,14], 
...             ['Dan',11,16], 
...             ['Sally',15,15], 
...             ['Dan',16,15], 
...             ['Sally',13,14]])
>>> results = Benfords.analyze(table, 1, number_of_significant_digits=2)            
>>> results.view()
+--------+--------------------+------------------+----------------------+-----------------+
| Number | Significant Digits | Actual Frequency | Expected Probability |    Difference   |
+--------+--------------------+------------------+----------------------+-----------------+
|     10 | 10                 |            0.125 |      0.0360270497068 | 0.0889729502932 |
|     12 | 12                 |             0.25 |      0.0327585353738 |  0.217241464626 |
|     11 | 11                 |             0.25 |      0.0342843373349 |  0.215715662665 |
|     12 | 12                 |             0.25 |      0.0327585353738 |  0.217241464626 |
|     11 | 11                 |             0.25 |      0.0342843373349 |  0.215715662665 |
|     15 | 15                 |            0.125 |      0.0291027478744 | 0.0958972521256 |
|     16 | 16                 |            0.125 |      0.0281085963079 | 0.0968914036921 |
|     13 | 13                 |            0.125 |       0.031406327064 |  0.093593672936 |
+--------+--------------------+------------------+----------------------+-----------------+
Example 2: I usually add a column for the difference from Benford's expectation to the table, then summarize to get an average difference per vendor, employee, etc. Individual numbers will often not match Benford, but averages across several numbers should match.
>>> table = Table([('col000', unicode), ('col001', int), ('col002', int)], [
...             ['Dan',10,8],
...             ['Sally',12,12],
...             ['Dan',11,15], 
...             ['Sally',12,14], 
...             ['Dan',11,16], 
...             ['Sally',15,15], 
...             ['Dan',16,15], 
...             ['Sally',13,14]])
>>> table.append_column('ben_diff', Benfords.analyze(table, 1, 2).column(4))
>>> table.view()
+--------+--------+--------+-----------------+
| col000 | col001 | col002 |     ben_diff    |
+--------+--------+--------+-----------------+
| Dan    |     10 |      8 | 0.0889729502932 |
| Sally  |     12 |     12 |  0.217241464626 |
| Dan    |     11 |     15 |  0.215715662665 |
| Sally  |     12 |     14 |  0.217241464626 |
| Dan    |     11 |     16 |  0.215715662665 |
| Sally  |     15 |     15 | 0.0958972521256 |
| Dan    |     16 |     15 | 0.0968914036921 |
| Sally  |     13 |     14 |  0.093593672936 |
+--------+--------+--------+-----------------+

>>> results = Grouping.summarize_by_value(table, 'col000', 
...           ben_avg="sum(group['ben_diff']) / len(group)")
>>> results.view()
+----------+--------+----------------+
| StartKey | EndKey |    ben_avg     |
+----------+--------+----------------+
| Dan      | Dan    | 0.154323919829 |
| Sally    | Sally  | 0.155993463579 |
+----------+--------+----------------+
Parameters:
table - A Picalo table
           (type=Table)
col - The column to be analyzed.
           (type=str)
number_of_significant_digits - The number of leading digits to use in the analysis. Higher numbers (3-5) require more data for statistical power.
           (type=int)
Returns:
A Picalo table describing the results of the analysis.
           (type=Table)

calc_benford(position, digit, base=10)

Helper function that codes benford's actual formula The generalized formula was found at http://www.mathpages.com/home/kmath302/kmath302.htm This method calculates the probability at a given digit is in a given position.
Parameters:
position - The position in the number (0=first digit, 1=second digit, ...)
           (type=int)
digit - The actual number (0,1,2,3,4,5,6,7,8,9) this digit is
           (type=int)
base - The number base. Optional (default is 10)
           (type=int)
Returns:
The Benford probability (the percentage of the time this position will have this digit).
           (type=float)

get_expected(number, number_of_significant_digits=1)

Calculates Benford's expected probability for a given number to a certain number of digits. For example, given the number 1234, calulate the combined probability that a 1 appears as the first digit, a 2 appears as the second digit, and so forth, to the number of desired significant digits.
Parameters:
number - The number (1234 in the example) to calculate the probability for
           (type=float)
number_of_significant_digits - The number of positions to use for the probability. In the example, a value of 2 means to calculate the probability for the number 12.
           (type=int)
Returns:
The expected frequency of this number according to Benford's Law.
           (type=float)

Variable Details

__functions__

Type:
tuple
Value:
('analyze', 'calc_benford', 'get_expected')                            

Generated by Epydoc 2.1 on Mon Aug 20 05:38:16 2007 http://epydoc.sf.net