analyze(table,
col,
number_of_significant_digits=1)
Performs a Benford's analysis on a table column. Returns a picalo
table. The frequency and expected values will be 0 for items that were
not analyzed (due to insufficient digits).
Important: if you ask for 2 significant digits, any input numbers
that do not have two digits are ignored and not included in the
results. If these numbers were not ignored, the analysis would throw
errors.
Example:
>>> table = Table([('col000', unicode), ('col001', int), ('col002', int)], [
... ['Dan',10,8],
... ['Sally',12,12],
... ['Dan',11,15],
... ['Sally',12,14],
... ['Dan',11,16],
... ['Sally',15,15],
... ['Dan',16,15],
... ['Sally',13,14]])
>>> results = Benfords.analyze(table, 1, number_of_significant_digits=2)
>>> results.view()
+--------+--------------------+------------------+----------------------+-----------------+
| Number | Significant Digits | Actual Frequency | Expected Probability | Difference |
+--------+--------------------+------------------+----------------------+-----------------+
| 10 | 10 | 0.125 | 0.0360270497068 | 0.0889729502932 |
| 12 | 12 | 0.25 | 0.0327585353738 | 0.217241464626 |
| 11 | 11 | 0.25 | 0.0342843373349 | 0.215715662665 |
| 12 | 12 | 0.25 | 0.0327585353738 | 0.217241464626 |
| 11 | 11 | 0.25 | 0.0342843373349 | 0.215715662665 |
| 15 | 15 | 0.125 | 0.0291027478744 | 0.0958972521256 |
| 16 | 16 | 0.125 | 0.0281085963079 | 0.0968914036921 |
| 13 | 13 | 0.125 | 0.031406327064 | 0.093593672936 |
+--------+--------------------+------------------+----------------------+-----------------+
Example 2: I usually add a column for the difference from Benford's
expectation to the table, then summarize to get an average difference
per vendor, employee, etc. Individual numbers will often not match
Benford, but averages across several numbers should match.
>>> table = Table([('col000', unicode), ('col001', int), ('col002', int)], [
... ['Dan',10,8],
... ['Sally',12,12],
... ['Dan',11,15],
... ['Sally',12,14],
... ['Dan',11,16],
... ['Sally',15,15],
... ['Dan',16,15],
... ['Sally',13,14]])
>>> table.append_column('ben_diff', Benfords.analyze(table, 1, 2).column(4))
>>> table.view()
+--------+--------+--------+-----------------+
| col000 | col001 | col002 | ben_diff |
+--------+--------+--------+-----------------+
| Dan | 10 | 8 | 0.0889729502932 |
| Sally | 12 | 12 | 0.217241464626 |
| Dan | 11 | 15 | 0.215715662665 |
| Sally | 12 | 14 | 0.217241464626 |
| Dan | 11 | 16 | 0.215715662665 |
| Sally | 15 | 15 | 0.0958972521256 |
| Dan | 16 | 15 | 0.0968914036921 |
| Sally | 13 | 14 | 0.093593672936 |
+--------+--------+--------+-----------------+
>>> results = Grouping.summarize_by_value(table, 'col000',
... ben_avg="sum(group['ben_diff']) / len(group)")
>>> results.view()
+----------+--------+----------------+
| StartKey | EndKey | ben_avg |
+----------+--------+----------------+
| Dan | Dan | 0.154323919829 |
| Sally | Sally | 0.155993463579 |
+----------+--------+----------------+
-
- Parameters:
table -
A Picalo table
(type=Table)
col -
The column to be analyzed.
(type=str)
number_of_significant_digits -
The number of leading digits to use in the analysis. Higher
numbers (3-5) require more data for statistical power.
(type=int)
- Returns:
-
A Picalo table describing the results of the analysis.
(type=Table)
|