| Home | Trees | Index | Help |
|
|---|
| Package picalo :: Module Grouping |
|
The Grouping module contains functions that stratify and summarize records in different ways. Grouping is a basic fraud detection method that helps generate norms and compare records against those norms. Further, it splits data sets into individual groups that should be analyzed separately.
The Grouping module only stratifiess a table into many tables. It doesn't do any summarization of values. All detail data are still in the tables.
The summary routines not only stratifies data on key, but also summarizes the detail tables using sum, mean, and other statistical routines. You get only one table returned from Summarize. Stratifying gives the intermediate result -- the full detail in many tables.
For those familiar with SQL, the summarize routines are similar to the GROUP BY keyword. Groups are collapsed into summary records and a single table is returned. The stratifying functions are similar to running one query to retrieve all unique key values, followed by a query to retrieve the records that match each key value. The result is many tables (one per unique key value).| Function Summary | |
|---|---|
| TableArray |
Stratifies a Picalo table into a specified number of sub-tables. |
| TableArray |
Stratifies the table rows into specific groups by date ranges. |
| TableArray |
Stratifies a table based upon the return value from an expression. |
| TableArray |
Stratifies a table based upon the value of col. |
| TableArray |
Stratifies a Picalo table by composite key (combination of values in the cols columns). |
| Table |
Summarizes a sequence of groups by evaluating a series of expressions on each group. |
| Table |
Summarizes the table rows into specific groups by date ranges. |
| Table |
Summarizess a table based upon the return value from expressions. |
| Table |
Summarizes a table based upon the value of col. |
| Table |
Stratifies a Picalo table by composite key (combination of values in the col_list columns). |
| Variable Summary | |
|---|---|
list |
__functions__ = ['stratify', 'stratify_by_value', 'strat...
|
| Function Details |
|---|
stratify(table, number_of_groups)Stratifies a Picalo table into a specified number of sub-tables. The table should be sorted appropriately before this function is called. This function does not modify the underlying table. The start and end record indices are recorded in table.startvalue and table.endvalue. See the example for more information. Example:>>> table = Table([('col000', int), ('col001', int)], [[1,1],[2,2],[3,3]]) >>> groups = Grouping.stratify(table, 2) >>> print groups[0].startvalue, groups[0].endvalue 0, 1 >>> groups[0].view() +--------+--------+ | col000 | col001 | +--------+--------+ | 1 | 1 | | 2 | 2 | +--------+--------+ >>> print groups[1].startvalue, groups[1].endvalue 3, 3 >>> groups[1].view() +--------+--------+ | col000 | col001 | +--------+--------+ | 3 | 3 | +--------+--------+
|
stratify_by_date(table, col, duration)Stratifies the table rows into specific groups by date ranges. This function is useful to split a table into ranges such as two-week periods (useful for analyzing timecard and invoice databases). Aging can also be done by sorting the table in reverse (newest to oldest dates) and using a negative duration. The start and end values of each group are recorded in the table object since they may be different than the actual start and end column values. See the example for more information. See the base.Calendar module for more information about time durations. Example:>>> table = Table([('col000', DateTime)], [ ... [DateTime(2000,1,1)], ... [DateTime(2000,1,13)], ... [DateTime(2000,1,14)], ... [DateTime(2000,1,15)] ... ]) >>> groups = Grouping.stratify_by_date(table, 0, DateDelta(7)) >>> print groups[0].startvalue, groups[0].endvalue 2000-01-01 00:00:00, 2000-01-08 00:00:00 >>> groups[0].view() +------------------------+ | col000 | +------------------------+ | 2000-01-01 00:00:00.00 | +------------------------+ >>> print groups[1].startvalue, groups[1].endvalue 2000-01-08 00:00:00, 2000-01-15 00:00:00 >>> groups[1].view() +------------------------+ | col000 | +------------------------+ | 2000-01-13 00:00:00.00 | | 2000-01-14 00:00:00.00 | +------------------------+ >>> print groups[2].startvalue, groups[2].endvalue 2000-01-15 00:00:00, 2000-01-22 00:00:00 >>> groups[2].view() +------------------------+ | col000 | +------------------------+ | 2000-01-15 00:00:00.00 | +------------------------+
|
stratify_by_expression(table, expression)Stratifies a table based upon the return value from an expression. For each record in the table, the expression is evaluated with the following variables:
If the expression evaluates to True, rec is placed in a new group and becomes startrec If the expression evaluates to False, rec is placed in the current group. The start and end record indices are recorded in table.startvalue and table.endvalue. See the example for more information. Example (starts a new group on each odd value in column 1):>>> table1 = Table([('col000', int)], [[1],[2],[3],[4]]) >>> groups = Grouping.stratify_by_expression(table1, "record[0] % 2.0 == 1.0") >>> print groups[0].startvalue, groups[0].endvalue 0, 1 >>> groups[0].view() +--------+ | col000 | +--------+ | 1 | | 2 | +--------+ >>> print groups[1].startvalue, groups[1].endvalue 3, 4 >>> groups[1].view() +--------+ | col000 | +--------+ | 3 | | 4 | +--------+
|
stratify_by_step(table, col, step)Stratifies a table based upon the value of col. Each time the value of col jumps > step, a new group is started. The table should be sorted correctly *before* this method is called. This method simply runs through the table records sequentially. The start and end values of each group are recorded in the table object since they may be different than the actual start and end column values. See the example for more information. Records are stratified where startvalue >= record[col] < endvalue. Example:>>> table1 = Table([('col000', int), ('col001', int)], [[1,1], [2,2], [5.9,3], [6,1], [8,2], [16,1]]) >>> groups = Grouping.stratify_by_step(table1, 0, 5) >>> print groups[0].startvalue, groups[0].endvalue 1, 6 >>> groups[0].view() +--------+--------+ | col000 | col001 | +--------+--------+ | 1 | 1 | | 2 | 2 | | 5.9 | 3 | +--------+--------+ >>> print groups[1].startvalue, groups[1].endvalue 6, 11 >>> groups[1].view() +--------+--------+ | col000 | col001 | +--------+--------+ | 6 | 1 | | 8 | 2 | +--------+--------+ >>> print groups[2].startvalue, groups[2].endvalue 11, 16 >>> groups[2].view() +--------+--------+ | col000 | col001 | +--------+--------+ +--------+--------+ >>> print groups[3].startvalue, groups[3].endvalue 16, 21 >>> groups[3].view() +--------+--------+ | col000 | col001 | +--------+--------+ | 16 | 1 | +--------+--------+
|
stratify_by_value(table, *cols)Stratifies a Picalo table by composite key (combination of values in the cols columns). A new table is created for each unique composite key, resulting in a list of tables. This function does not modify the underlying table. Each key is recorded as the start and end values of each group. See the example for more information. Example:>>> table = Table([('col000', int), ('col001', int)], [['Dan',10],['Sally',10],['Dan',11]]) >>> groups = Grouping.stratify_by_value(table, 'col000') >>> groups[0].view() +--------+--------+ | col000 | col001 | +--------+--------+ | Dan | 10 | | Dan | 11 | +--------+--------+ >>> groups[1].view() +--------+--------+ | col000 | col001 | +--------+--------+ | Sally | 10 | +--------+--------+
|
summarize(groups, **expressions)Summarizes a sequence of groups by evaluating a series of expressions on each group. The result is a single table with one row representing each group in groups. This is analogous to the SQL GROUP BY command. Each item in expressions should be an expression that summarizes a group. It must evaluate to a single value summarizing the entire table. Most expressions will probably only summarize a single column as is done in the example. Use the Table['colname'] method do get the desired column. Example:>>> from picalo.lib import stats >>> table = Table([('col000', int), ('col001', int), ('col002', int')], [[1,1,1],[1,1,2],[2,1,2],[2,1,3]]) >>> groups = Grouping.stratify_by_value(table, 0, 1) >>> summary = Grouping.summarize(groups, sum="sum(group['col002'])", avg="stats.mean(group['col002'])") >>> summary.view() +------------+----------+-----+-----+ | StartValue | EndValue | sum | avg | +------------+----------+-----+-----+ | (1, 1) | (1, 1) | 3 | 1.5 | | (2, 1) | (2, 1) | 5 | 2.5 | +------------+----------+-----+-----+
|
summarize_by_date(table, col, num_days_in_groups, **expressions)Summarizes the table rows into specific groups by date ranges. The function then summarizes the list of groups by running a series of functions on each group. The result is single table with one row representing each group. This is analogous to the SQL GROUP BY command. This function does not modify the underlying table. Aging can also be done by sorting the table in reverse (newest to oldest dates) and using a negative num_days_in_groups. Each item in expressions should be an expression that summarizes a group. It must evaluate to a single value summarizing the entire table. Most expressions will probably only summarize a single column as is done in the example. Use the Table['colname'] method do get the desired column. Example:>>> table = Table([('col000', DateTime), ('col001', int), ('col002', int)], [ ... [DateTime(2000,1,1)], ... [DateTime(2000,1,13)], ... [DateTime(2000,1,14)], ... [DateTime(2000,1,15)] ... ]) >>> summary = Grouping.summarize_by_date(table, 0, TimeDelta(7), first="group[0][0]", last="group[-1][0]") >>> summary.view() +---------------------+---------------------+---------------------+---------------------+ | StartValue | EndValue | last | first | +---------------------+---------------------+---------------------+---------------------+ | 2000-01-01 00:00:00 | 2000-01-08 00:00:00 | 2000-01-01 00:00:00 | 2000-01-01 00:00:00 | | 2000-01-08 00:00:00 | 2000-01-15 00:00:00 | 2000-01-14 00:00:00 | 2000-01-13 00:00:00 | | 2000-01-15 00:00:00 | 2000-01-22 00:00:00 | 2000-01-15 00:00:00 | 2000-01-15 00:00:00 | +---------------------+---------------------+---------------------+---------------------+
|
summarize_by_expression(table, stratifying_expression, **expressions)Summarizess a table based upon the return value from expressions. The function then summarizes the list of groups by running a series of functions on each group. The result is single table with one row representing each group. This is analogous to the SQL GROUP BY command. This function does not modify the underlying table. Each item in expressions should be an expression that summarizes a group. It must evaluate to a single value summarizing the entire table. Most expressions will probably only summarize a single column as is done in the example. Use the Table['colname'] method do get the desired column.
|
summarize_by_step(table, col, step, **expressions)Summarizes a table based upon the value of col. Each time the value of col jumps > step, a new group is started. The function then summarizes the list of groups by running a series of functions on each group. The result is single table with one row representing each group. This is analogous to the SQL GROUP BY command. This function does not modify the underlying table. Each item in expressions should be an expression that summarizes a group. It must evaluate to a single value summarizing the entire table. Most expressions will probably only summarize a single column as is done in the example. Use the Table['colname'] method do get the desired column. Example:>>> from picalo.lib import stats >>> table1 = Table([('col000', int), ('col001', int)], [[1,1], [2,2], [5.9,3], [6,1], [8,2], [16,1]]) >>> summary = Grouping.summarize_by_step(table1, 0, 5, count="len(group)") >>> summary.view() +------------+----------+-------+ | StartValue | EndValue | count | +------------+----------+-------+ | 1 | 6 | 3 | | 6 | 11 | 2 | | 11 | 16 | 0 | | 16 | 21 | 1 | +------------+----------+-------+
|
summarize_by_value(table, *cols, **expressions)Stratifies a Picalo table by composite key (combination of values in the col_list columns). A new table is created for each unique composite key, resulting in a list of tables. The function then summarizes the list of groups by running a series of expressions on each group. The result is single table with one row representing each group. This is analogous to the SQL GROUP BY command. This function does not modify the underlying table. Each item in expressions should be an expression that summarizes a group. It must evaluate to a single value summarizing the entire table. Most expressions will probably only summarize a single column as is done in the example. Use the Table['colname'] method do get the desired column. Example:>>> from picalo import * >>> from picalo.lib import stats >>> table = Table([('col000', unicode), ('col001', int), ('col002', int)], [ ... ['Dan',10,8], ... ['Sally',12,12], ... ['Dan',11,15], ... ['Sally',12,14], ... ['Dan',11,16], ... ['Sally',15,15], ... ['Dan',16,15], ... ['Sally',13,14]]) >>> results = Grouping.summarize_by_value(table, 'col000', ... sum="sum(group['col001'])", ... correlation="stats.spearmanr(list(group['col001']), list(['col002']))") >>> results.view() +-------+-----+--------------------------------------------+ | Value | sum | correlation | +-------+-----+--------------------------------------------+ | Dan | 48 | (0.55000000000000004, 0.45000000000651308) | | Sally | 52 | (0.84999999999999998, 0.1499999999962324) | +-------+-----+--------------------------------------------+ >>> # the first number in the correlation is the statistic, second number is the p-value
|
| Variable Details |
|---|
__functions__
|
| Home | Trees | Index | Help |
|
|---|
| Generated by Epydoc 2.1 on Mon Aug 20 05:38:16 2007 | http://epydoc.sf.net |