Package picalo :: Module Grouping
[show private | hide private]
[frames | no frames]

Module picalo.Grouping

The Grouping module contains functions that stratify and summarize records in different ways. Grouping is a basic fraud detection method that helps generate norms and compare records against those norms. Further, it splits data sets into individual groups that should be analyzed separately.

The Grouping module only stratifiess a table into many tables. It doesn't do any summarization of values. All detail data are still in the tables.

The summary routines not only stratifies data on key, but also summarizes the detail tables using sum, mean, and other statistical routines. You get only one table returned from Summarize. Stratifying gives the intermediate result -- the full detail in many tables.

For those familiar with SQL, the summarize routines are similar to the GROUP BY keyword. Groups are collapsed into summary records and a single table is returned. The stratifying functions are similar to running one query to retrieve all unique key values, followed by a query to retrieve the records that match each key value. The result is many tables (one per unique key value).
Function Summary
TableArray stratify(table, number_of_groups)
Stratifies a Picalo table into a specified number of sub-tables.
TableArray stratify_by_date(table, col, duration)
Stratifies the table rows into specific groups by date ranges.
TableArray stratify_by_expression(table, expression)
Stratifies a table based upon the return value from an expression.
TableArray stratify_by_step(table, col, step)
Stratifies a table based upon the value of col.
TableArray stratify_by_value(table, *cols)
Stratifies a Picalo table by composite key (combination of values in the cols columns).
Table summarize(groups, **expressions)
Summarizes a sequence of groups by evaluating a series of expressions on each group.
Table summarize_by_date(table, col, num_days_in_groups, **expressions)
Summarizes the table rows into specific groups by date ranges.
Table summarize_by_expression(table, stratifying_expression, **expressions)
Summarizess a table based upon the return value from expressions.
Table summarize_by_step(table, col, step, **expressions)
Summarizes a table based upon the value of col.
Table summarize_by_value(table, *cols, **expressions)
Stratifies a Picalo table by composite key (combination of values in the col_list columns).

Variable Summary
list __functions__ = ['stratify', 'stratify_by_value', 'strat...

Function Details

stratify(table, number_of_groups)

Stratifies a Picalo table into a specified number of sub-tables. The table should be sorted appropriately before this function is called. This function does not modify the underlying table.

The start and end record indices are recorded in table.startvalue and table.endvalue. See the example for more information.

Example:
>>> table = Table([('col000', int), ('col001', int)], [[1,1],[2,2],[3,3]])
>>> groups = Grouping.stratify(table, 2)
>>> print groups[0].startvalue, groups[0].endvalue
0, 1

>>> groups[0].view()
+--------+--------+
| col000 | col001 |
+--------+--------+
|      1 |      1 |
|      2 |      2 |
+--------+--------+

>>> print groups[1].startvalue, groups[1].endvalue
3, 3

>>> groups[1].view()
+--------+--------+
| col000 | col001 |
+--------+--------+
|      3 |      3 |
+--------+--------+
Parameters:
table - The table to be stratified
           (type=Table)
number_of_groups - The number of sub-tables to create
           (type=int)
Returns:
A list of new tables
           (type=TableArray)

stratify_by_date(table, col, duration)

Stratifies the table rows into specific groups by date ranges. This function is useful to split a table into ranges such as two-week periods (useful for analyzing timecard and invoice databases).

Aging can also be done by sorting the table in reverse (newest to oldest dates) and using a negative duration.

The start and end values of each group are recorded in the table object since they may be different than the actual start and end column values. See the example for more information.

See the base.Calendar module for more information about time durations.

Example:
>>> table = Table([('col000', DateTime)], [
...   [DateTime(2000,1,1)],
...   [DateTime(2000,1,13)],
...   [DateTime(2000,1,14)],
...   [DateTime(2000,1,15)]
... ])
>>> groups = Grouping.stratify_by_date(table, 0, DateDelta(7))
>>> print groups[0].startvalue, groups[0].endvalue
2000-01-01 00:00:00, 2000-01-08 00:00:00

>>> groups[0].view()
+------------------------+
|         col000         |
+------------------------+
| 2000-01-01 00:00:00.00 |
+------------------------+

>>> print groups[1].startvalue, groups[1].endvalue
2000-01-08 00:00:00, 2000-01-15 00:00:00

>>> groups[1].view()
+------------------------+
|         col000         |
+------------------------+
| 2000-01-13 00:00:00.00 |
| 2000-01-14 00:00:00.00 |
+------------------------+

>>> print groups[2].startvalue, groups[2].endvalue
2000-01-15 00:00:00, 2000-01-22 00:00:00

>>> groups[2].view()
+------------------------+
|         col000         |
+------------------------+
| 2000-01-15 00:00:00.00 |
+------------------------+
Parameters:
table - The table to be stratified
           (type=Table)
col - The column to stratify by.
           (type=str)
duration - The number of days or seconds to put into each group.
           (type=int)
Returns:
A list of new tables, each containing rows from table that were stratified together
           (type=TableArray)

stratify_by_expression(table, expression)

Stratifies a table based upon the return value from an expression. For each record in the table, the expression is evaluated with the following variables:
  1. startrecord => the starting record of the current group.
  2. record => the current record being evaluated.

If the expression evaluates to True, rec is placed in a new group and becomes startrec If the expression evaluates to False, rec is placed in the current group.

The start and end record indices are recorded in table.startvalue and table.endvalue. See the example for more information.

Example (starts a new group on each odd value in column 1):
>>> table1 = Table([('col000', int)], [[1],[2],[3],[4]])
>>> groups = Grouping.stratify_by_expression(table1, "record[0] % 2.0 == 1.0")
>>> print groups[0].startvalue, groups[0].endvalue
0, 1

>>> groups[0].view()
+--------+
| col000 |
+--------+
|      1 |
|      2 |
+--------+

>>> print groups[1].startvalue, groups[1].endvalue
3, 4

>>> groups[1].view()
+--------+
| col000 |
+--------+
|      3 |
|      4 |
+--------+
Parameters:
table - The table to be stratified
           (type=Table)
expression - An expression that evaluates the current record and returns whether a new group should be started
           (type=str)
Returns:
A list of new tables, each containing rows from table that were grouped together
           (type=TableArray)

stratify_by_step(table, col, step)

Stratifies a table based upon the value of col. Each time the value of col jumps > step, a new group is started.

The table should be sorted correctly *before* this method is called. This method simply runs through the table records sequentially.

The start and end values of each group are recorded in the table object since they may be different than the actual start and end column values. See the example for more information.

Records are stratified where startvalue >= record[col] < endvalue.

Example:
>>> table1 = Table([('col000', int), ('col001', int)], [[1,1], [2,2], [5.9,3], [6,1], [8,2], [16,1]])
>>> groups = Grouping.stratify_by_step(table1, 0, 5)
>>> print groups[0].startvalue, groups[0].endvalue
1, 6

>>> groups[0].view()
+--------+--------+
| col000 | col001 |
+--------+--------+
|      1 |      1 |
|      2 |      2 |
|    5.9 |      3 |
+--------+--------+

>>> print groups[1].startvalue, groups[1].endvalue
6, 11

>>> groups[1].view()
+--------+--------+
| col000 | col001 |
+--------+--------+
|      6 |      1 |
|      8 |      2 |
+--------+--------+

>>> print groups[2].startvalue, groups[2].endvalue
11, 16

>>> groups[2].view()
+--------+--------+
| col000 | col001 |
+--------+--------+
+--------+--------+

>>> print groups[3].startvalue, groups[3].endvalue
16, 21

>>> groups[3].view()
+--------+--------+
| col000 | col001 |
+--------+--------+
|     16 |      1 |
+--------+--------+
Parameters:
table - The table to be stratified
           (type=Table)
col - The column to use to step by
           (type=str)
step - The step value that starts a new group
           (type=int)
Returns:
A list of new tables, each containing rows from table that were stratified together
           (type=TableArray)

stratify_by_value(table, *cols)

Stratifies a Picalo table by composite key (combination of values in the cols columns). A new table is created for each unique composite key, resulting in a list of tables. This function does not modify the underlying table.

Each key is recorded as the start and end values of each group. See the example for more information.

Example:
>>> table = Table([('col000', int), ('col001', int)], [['Dan',10],['Sally',10],['Dan',11]])
>>> groups = Grouping.stratify_by_value(table, 'col000')
>>> groups[0].view()
+--------+--------+
| col000 | col001 |
+--------+--------+
| Dan    |     10 |
| Dan    |     11 |
+--------+--------+

>>> groups[1].view()
+--------+--------+
| col000 | col001 |
+--------+--------+
| Sally  |     10 |
+--------+--------+
Parameters:
table - The table to be stratified
           (type=Table)
cols - The remaining parameters are the column names/indices to stratify by
           (type=str)
Returns:
A list of new tables, each table containing rows with the same key values.
           (type=TableArray)

summarize(groups, **expressions)

Summarizes a sequence of groups by evaluating a series of expressions on each group. The result is a single table with one row representing each group in groups. This is analogous to the SQL GROUP BY command.

Each item in expressions should be an expression that summarizes a group. It must evaluate to a single value summarizing the entire table. Most expressions will probably only summarize a single column as is done in the example. Use the Table['colname'] method do get the desired column.

Example:
>>> from picalo.lib import stats
>>> table = Table([('col000', int), ('col001', int), ('col002', int')], [[1,1,1],[1,1,2],[2,1,2],[2,1,3]])
>>> groups = Grouping.stratify_by_value(table, 0, 1)
>>> summary = Grouping.summarize(groups, sum="sum(group['col002'])", avg="stats.mean(group['col002'])")
>>> summary.view()
+------------+----------+-----+-----+
| StartValue | EndValue | sum | avg |
+------------+----------+-----+-----+
|     (1, 1) |   (1, 1) |   3 | 1.5 |
|     (2, 1) |   (2, 1) |   5 | 2.5 |
+------------+----------+-----+-----+
Parameters:
groups - A TableArray of groups, probably created by one of the stratify_by_... routines.
           (type=TableArray)
expressions - One or more colname=expression pairs to summarize by.
Returns:
A single table containing the summaries of the groups.
           (type=Table)

summarize_by_date(table, col, num_days_in_groups, **expressions)

Summarizes the table rows into specific groups by date ranges. The function then summarizes the list of groups by running a series of functions on each group. The result is single table with one row representing each group. This is analogous to the SQL GROUP BY command. This function does not modify the underlying table.

Aging can also be done by sorting the table in reverse (newest to oldest dates) and using a negative num_days_in_groups.

Each item in expressions should be an expression that summarizes a group. It must evaluate to a single value summarizing the entire table. Most expressions will probably only summarize a single column as is done in the example. Use the Table['colname'] method do get the desired column.

Example:
>>> table = Table([('col000', DateTime), ('col001', int), ('col002', int)], [
...   [DateTime(2000,1,1)],
...   [DateTime(2000,1,13)],
...   [DateTime(2000,1,14)],
...   [DateTime(2000,1,15)]
... ])
>>> summary = Grouping.summarize_by_date(table, 0, TimeDelta(7), first="group[0][0]", last="group[-1][0]")
>>> summary.view()
+---------------------+---------------------+---------------------+---------------------+
|      StartValue     |       EndValue      |         last        |        first        |
+---------------------+---------------------+---------------------+---------------------+
| 2000-01-01 00:00:00 | 2000-01-08 00:00:00 | 2000-01-01 00:00:00 | 2000-01-01 00:00:00 |
| 2000-01-08 00:00:00 | 2000-01-15 00:00:00 | 2000-01-14 00:00:00 | 2000-01-13 00:00:00 |
| 2000-01-15 00:00:00 | 2000-01-22 00:00:00 | 2000-01-15 00:00:00 | 2000-01-15 00:00:00 |
+---------------------+---------------------+---------------------+---------------------+
Parameters:
table - The table to be summarized
           (type=Table)
col - The column to stratify by.
           (type=str)
num_days_in_groups - The number of days or seconds to put into each group.
           (type=int)
expressions - One or more colname=expression pairs to summarize by.
Returns:
A single table containing the summaries of the groups.
           (type=Table)

summarize_by_expression(table, stratifying_expression, **expressions)

Summarizess a table based upon the return value from expressions. The function then summarizes the list of groups by running a series of functions on each group. The result is single table with one row representing each group. This is analogous to the SQL GROUP BY command. This function does not modify the underlying table.

Each item in expressions should be an expression that summarizes a group. It must evaluate to a single value summarizing the entire table. Most expressions will probably only summarize a single column as is done in the example. Use the Table['colname'] method do get the desired column.
Parameters:
table - The table to be summarized
           (type=Table)
stratifying_expression - An expression that evaluates the current record and returns whether a new group should be started
           (type=str)
expressions - One or more colname=expression pairs to summarize by.
Returns:
A single table containing the summaries of the groups.
           (type=Table)

summarize_by_step(table, col, step, **expressions)

Summarizes a table based upon the value of col. Each time the value of col jumps > step, a new group is started. The function then summarizes the list of groups by running a series of functions on each group. The result is single table with one row representing each group. This is analogous to the SQL GROUP BY command. This function does not modify the underlying table.

Each item in expressions should be an expression that summarizes a group. It must evaluate to a single value summarizing the entire table. Most expressions will probably only summarize a single column as is done in the example. Use the Table['colname'] method do get the desired column.

Example:
>>> from picalo.lib import stats
>>> table1 = Table([('col000', int), ('col001', int)], [[1,1], [2,2], [5.9,3], [6,1], [8,2], [16,1]])
>>> summary = Grouping.summarize_by_step(table1, 0, 5, count="len(group)")
>>> summary.view()
+------------+----------+-------+
| StartValue | EndValue | count |
+------------+----------+-------+
|          1 |        6 |     3 |
|          6 |       11 |     2 |
|         11 |       16 |     0 |
|         16 |       21 |     1 |
+------------+----------+-------+
Parameters:
table - The table to be summarized
           (type=Table)
col - The column to stratify by.
           (type=str)
step - The step value that starts a new group
           (type=float)
expressions - One or more colname=expression pairs to summarize by.
Returns:
A single table containing the summaries of the groups.
           (type=Table)

summarize_by_value(table, *cols, **expressions)

Stratifies a Picalo table by composite key (combination of values in the col_list columns). A new table is created for each unique composite key, resulting in a list of tables. The function then summarizes the list of groups by running a series of expressions on each group. The result is single table with one row representing each group. This is analogous to the SQL GROUP BY command. This function does not modify the underlying table.

Each item in expressions should be an expression that summarizes a group. It must evaluate to a single value summarizing the entire table. Most expressions will probably only summarize a single column as is done in the example. Use the Table['colname'] method do get the desired column.

Example:
>>> from picalo import *
>>> from picalo.lib import stats
>>> table = Table([('col000', unicode), ('col001', int), ('col002', int)], [
...      ['Dan',10,8],
...      ['Sally',12,12],
...      ['Dan',11,15], 
...      ['Sally',12,14], 
...      ['Dan',11,16], 
...      ['Sally',15,15], 
...      ['Dan',16,15], 
...      ['Sally',13,14]])
>>> results = Grouping.summarize_by_value(table, 'col000', 
...      sum="sum(group['col001'])", 
...      correlation="stats.spearmanr(list(group['col001']), list(['col002']))")
>>> results.view()
+-------+-----+--------------------------------------------+
| Value | sum |                correlation                 |
+-------+-----+--------------------------------------------+
| Dan   |  48 | (0.55000000000000004, 0.45000000000651308) |
| Sally |  52 |  (0.84999999999999998, 0.1499999999962324) |
+-------+-----+--------------------------------------------+

>>> # the first number in the correlation is the statistic, second number is the p-value
Parameters:
table - The table to be summarized
           (type=Table)
cols - One or more columns to summarize with the given expressions.
expressions - One or more colname=expression pairs to summarize by.
Returns:
A single table containing the summaries of the groups.
           (type=Table)

Variable Details

__functions__

Type:
list
Value:
['stratify',
 'stratify_by_value',
 'stratify_by_expression',
 'stratify_by_step',
 'stratify_by_date',
 'summarize',
 'summarize_by_expression',
 'summarize_by_value',
...                                                                    

Generated by Epydoc 2.1 on Mon Aug 20 05:38:16 2007 http://epydoc.sf.net