pynssp.detectors package
Submodules
pynssp.detectors.ewma module
- pynssp.detectors.ewma.alert_ewma(df, t='date', y='count', B=28, g=2, w1=0.4, w2=0.9)[source]
Exponentially Weighted Moving Average (EWMA)
The EWMA compares a weighted average of the most recent visit counts to a baseline expectation. For the weighted average to be tested, an exponential weighting gives the most influence to the most recent observations. This algorithm is appropriate for daily counts that do not have the characteristic features modeled in the regression algorithm. It is more applicable for Emergency Department data from certain hospital groups and for time series with small counts (daily average below 10) because of the limited case definition or chosen geographic region. An alert (red value) is signaled if the statistical test (student”s t-test) applied to the test statistic yields a p-value less than 0.01. If the p-value is greater than or equal to 0.01 and strictly less than 0.05, a warning (yellow value) is signaled. Blue values are returned if an alert or warning does not occur. Grey values represent instances where anomaly detection did not apply (i.e., observations for which baseline data were unavailable).
- Parameters:
df – A pandas data frame containing time series data
t – Name of the column of type Date containing the dates (Default value = “date”)
y – Name of the column of type Numeric containing counts or percentages (Default value = “count”)
B – Baseline parameter. The baseline length is the number of days used to calculate rolling averages, standard deviations, and exponentially weighted moving averages. Defaults to 28 days to match ESSENCE implementation.
g – Guardband parameter. The guardband length is the number of days separating the baseline from the current test date. Defaults to 2 days to match ESSENCE implementation.
w1 – Smoothing coefficient for sensitivity to gradual events. Must be between 0 and 1 and is recommended to be between 0.3 and 0.5 to account for gradual effects. Defaults to 0.4 to match ESSENCE implementation.
w2 – Smoothed coefficient for sensitivity to sudden events. Must be between 0 and 1 and is recommended to be above 0.7 to account for sudden events. Defaults to 0.9 to match ESSENCE implementation and approximate the C2 algorithm.
- Returns:
Original pandas data frame with detection results.
- Examples:
>>> from pynssp import alert_ewma >>> import pandas as pd >>> import numpy as np >>> >>> df = pd.DataFrame({ ... "date": pd.date_range("2020-01-01", "2020-12-31"), ... "count": np.random.randint(0, 101, size=366) ... }) >>> >>> df_ewma = alert_ewma(df) >>> df_ewma.head()
- pynssp.detectors.ewma.ewma_loop(df, t, y, B, g, w1, w2)[source]
Loop for EWMA
Loop for EWMA and adjustment of outlying smoothed values
- Parameters:
df – A pandas data frame
t – Name of the column of type Date containing the dates
y – Name of the column containing the response variable data
mu – Numeric vector of baseline averages
B – Baseline parameter. The baseline length is the number of days used to calculate rolling averages, standard deviations, and exponentially weighted moving averages. Defaults to 28 days to match NSSP-ESSENCE implementation.
g – Guardband parameter. The guardband length is the number of days separating the baseline from the current test date. Defaults to 2 days to match NSSP-ESSENCE implementation.
w1 – Smoothing coefficient for sensitivity to gradual events. Must be between 0 and 1 and is recommended to be between 0.3 and 0.5 to account for gradual effects. Defaults to 0.4 to match NSSP-ESSENCE implementation.
w2 – Smoothed coefficient for sensitivity to sudden events. Must be between 0 and 1 and is recommended to be above 0.7 to account for sudden events. Defaults to 0.9 to match NSSP-ESSENCE implementation and approximate the C2 algorithm.
- Returns:
A pandas data frame with p-values and test statistics
pynssp.detectors.nbinom module
- pynssp.detectors.nbinom.alert_nbinom(df, baseline_end, t='date', y='count', include_time=True)[source]
Negative binomial detection algorithm for weekly counts
The negative binomial regression algorithm fits a negative binomial regression model with a time term and order 1 Fourier terms to a baseline period that spans 2 or more years. Inclusion of Fourier terms in the model is intended to account for seasonality common in multi-year weekly time series of counts. Order 1 sine and cosine terms are included to account for annual seasonality that is common to syndromes and diseases such as influenza, RSV, and norovirus. Each baseline model is used to make weekly forecasts for all weeks following the baseline period. One-sided upper 95% prediction interval bounds are computed for each week in the prediction period. Alarms are signaled for any week during for which weekly counts fall above the upper bound of the prediction interval.
- Parameters:
df – A pandas data frame containing time series data
t – Name of the column of type Date containing the dates (Default value = “date”)
y – Name of the column of type Numeric containing counts or percentages (Default value = “count”)
baseline_end – date of the end of the baseline/training period (in date or string class)
include_time – Indicate whether or not to include time term in regression model (default is True)
- Returns:
Original pandas dataframe with model estimates, upper prediction interval bounds, a binary alarm indicator field, and a binary indicator field of whether or not a time term was included.
- Examples:
>>> from pnssp import alert_nbinom >>> import pandas as pd >>> import numpy as np >>> >>> df = pd.DataFrame({ ... "date": pd.date_range(start="2014-01-05", end="2022-02-05", freq="W"), ... "count": np.random.poisson(lam=25, size=(len(pd.date_range(start="2014-01-05", end="2022-02-05", freq="W")),)) ... }) >>> >>> df_nbinom = alert_nbinom(df, baseline_end = "2020-03-01") >>> df_nbinom.head()
- pynssp.detectors.nbinom.nb_model(df, t, y, baseline_end, include_time)[source]
Negative binomial regression model for weekly counts
Negative binomial model helper function for monitoring weekly count time series with seasonality
- Parameters:
df – A pandas data frame
t – Name of the column of type Date containing the dates
y – Name of the column of type Numeric containing counts
baseline_end – Object of type Date defining the end of the baseline/training period
include_time – Logical indicating whether or not to include time term in regression model
- Returns:
A pandas data frame.
pynssp.detectors.regression module
- pynssp.detectors.regression.adaptive_regression(df, t, y, B, g)[source]
Adaptive Regression
Adaptive Regression helper function for Adaptive Multiple Regression
- Parameters:
df – A pandas data frame containing time series data
t – Name of the column of type Date containing the dates
y – Name of the column containing the response variable data
B – Baseline parameter. The baseline length is the number of days to which each liner model is fit.
g – Guardband parameter. The guardband length is the number of days separating the baseline from the current date in consideration for alerting.
- Returns:
A pandas data frame with p-values and test statistics
- pynssp.detectors.regression.alert_regression(df, t='date', y='count', B=28, g=2)[source]
Adaptive Multiple Regression
The adaptive multiple regression algorithm fits a linear model to a baseline of counts or percentages of length B, and forecasts a predicted value g + 1 days later (guard-band). This value is compared to the current observed value and divided by the standard error of prediction in the test-statistic. The model includes terms to account for linear trends and day-of-week effects. Note that this implementation does NOT account for federal holidays as in the Regression 1.2 algorithm in ESSENCE. An alert (red value) is signaled if the statistical test (student”s t-test) applied to the test statistic yields a p-value less than 0.01. If the p-value is greater than or equal to 0.01 and strictly less than 0.05, a warning (yellow value) is signaled. Blue values are returned if an alert or warning does not occur. Grey values represent instances where anomaly detection did not apply (i.e., observations for which baseline data were unavailable).
- Parameters:
df – A pandas data frame containing time series data
t – The name of the column in df that contains the dates or times of observations. Defaults to “date”.
y – The name of the column in df that contains the values of the time series. Defaults to “count”.
B – The length of the baseline period (in days). Must be a multiple of 7 and at least 7. Defaults to 28.
g – The length of the guard band (in days). Must be non-negative. Defaults to 2.
- Returns:
Original pandas data frame with detection results.
- Examples:
>>> from pynssp import alert_regression >>> import pandas as pd >>> import numpy as np >>> >>> df = pd.DataFrame({ ... "date": pd.date_range("2020-01-01", "2020-12-31"), ... "count": np.random.randint(0, 101, size=366) ... }) >>> >>> df_regression = alert_regression(df) >>> df_regression.head()
pynssp.detectors.serfling module
- pynssp.detectors.serfling.alert_serfling(df, baseline_end, t='date', y='count')[source]
Original Serfling method for weekly time series
The original Serfling algorithm fits a linear regression model with a time term and order 1 Fourier terms to a baseline period that ideally spans 5 or more years. Inclusion of Fourier terms in the model is intended to account for seasonality common in multi-year weekly time series. Order 1 sine and cosine terms are included to account for annual seasonality that is common to syndromes and diseases such as influenza, RSV, and norovirus. Each baseline model is used to make weekly forecasts for all weeks following the baseline period. One-sided upper 95% prediction interval bounds are computed for each week in the prediction period. Alarms are signaled for any week during for which weekly observations fall above the upper bound of the prediction interval. This implementation follows the approach of the original Serfling method in which weeks between October of the starting year of a season and May of the ending year of a season are considered to be in the epidemic period. Weeks in the epidemic period are removed from the baseline prior to fitting the regression model.
- Parameters:
df – A pandas data frame containing time series data
t – Name of the column of type Date containing the dates (Default value = “date”)
y – Name of the column of type Numeric containing counts or percentages (Default value = “count”)
baseline_end – date of the end of the baseline/training period (in date or string class)
- Returns:
Original pandas dataframe with model estimates, upper prediction interval bounds, a binary alarm indicator field, and a binary indicator
- Examples:
>>> from pynssp import alert_serfling >>> import pandas as pd >>> import numpy as np >>> >>> df = pd.DataFrame({ ... "date": pd.date_range(start="2014-01-05", end="2022-02-05", freq="W"), ... "count": np.random.poisson(lam=25, size=(len(pd.date_range(start="2014-01-05", end="2022-02-05", freq="W")),)) ... }) >>> >>> df_serfling = alert_serfling(df, baseline_end = "2020-03-01") >>> df_serfling.head()
- pynssp.detectors.serfling.serfling_model(df, t, y, baseline_end)[source]
Original Serfling method for weekly time series
Serfling model helper function for monitoring weekly time series with seasonality
- Parameters:
df – A pandas data frame
t – Name of the column of type Date containing the dates
y – Name of the column of type Numeric containing counts
baseline_end – date of the end of the baseline/training period (date or string class)
- Returns:
A pandas data frame.
pynssp.detectors.switch module
- pynssp.detectors.switch.alert_switch(df, t='date', y='count', B=28, g=2, w1=0.4, w2=0.9)[source]
Regression/EWMA Switch
The NSSP-ESSENCE Regression/EWMA Switch algorithm generalized the Regression and EWMA algorithms by applying the most appropriate algorithm for the data in the baseline. First, multiple adaptive regression is applied where the adjusted R squared value of the model is examined to see if it meets a threshold of 0.60. If this threshold is not met, then the model is considered to not explain the data well. In this case, the algorithm switches to the EWMA algorithm, which is more appropriate for sparser time series that are common with county level trends. The smoothing coefficient for the EWMA algorithm is fixed to 0.4.
- Parameters:
df – A dataframe containing the time series data.
t – The name of the column in df containing the time information. Defaults to “date”.
y – The name of the column in df containing the values to be analyzed. Defaults to “count”.
B – The length of the baseline period in days, must be a multiple of 7 and greater than or equal to 7. Defaults to 28.
g – The length of the guardband period in days. Must be non-negative. Defaults to 2.
w1 – Smoothing coefficient for sensitivity to gradual events. Must be between 0 and 1 and is recommended to be between 0.3 and 0.5 to account for gradual effects. Defaults to 0.4 to match NSSP-ESSENCE implementation.
w2 – Smoothed coefficient for sensitivity to sudden events. Must be between 0 and 1 and is recommended to be above 0.7 to account for sudden events. Defaults to 0.9 to match NSSP-ESSENCE implementation and approximate the C2 algorithm.
- Returns:
A dataframe containing the results of the analysis.
- Examples:
>>> from pynssp import alert_switch >>> import pandas as pd >>> import numpy as np >>> >>> df = pd.DataFrame({ ... "date": pd.date_range("2020-01-01", "2020-12-31"), ... "count": np.random.randint(0, 101, size=366) ... }) >>> >>> df_switch = alert_switch(df) >>> df_switch.head()
pynssp.detectors.trend module
- pynssp.detectors.trend.classify_trend(df, t='date', data_count='dataCount', all_count='allCount', B=12)[source]
Trend Classification for Proportions/Percentages
The algorithm fits rolling binomial models to a daily time series of percentages or proportions in order to classify the overall trend during the baseline period as significantly increasing, significantly decreasing, or stable.
- Parameters:
df – A pandas data frame
t – Name of the column of type Date containing the dates (Default value = “date”)
data_count – Name of the column with counts for positive encounters (Default value = “dataCount”)
all_count – Name of the column with total counts of encounters (Default value = “allCount”)
B – Baseline parameter. The baseline length is the number of days to which each binomial model is fit (Default value = 12)
- Returns:
A pandas data frame
- Examples:
>>> from pynssp import classify_trend >>> import pandas as pd >>> import numpy as np >>> >>> df = pd.DataFrame({ ... "date": pd.date_range("2020-01-01", "2020-12-31"), ... "dataCount": np.random.randint(0, 101, size=366), ... "allCount": np.random.randint(101, 500, size=366) ... }) >>> >>> df_trend = classify_trend(df) >>> df_trend.head()
- pynssp.detectors.trend.get_trends(df, t, data_count, all_count, B)[source]
Trend Classification Helper
Fits rolling binomial models to a daily time series of percentages or proportions in order to classify the overall trend during the baseline period as significantly increasing, significantly decreasing, or stable.
- Parameters:
df – A pandas data frame
t – Name of the column of type Date containing the dates
data_count – Name of the column with counts for positive encounters
all_count – Name of the column with total counts of encounters
B – Baseline parameter. The baseline length is the number of days to which each binomial model is fit
- Returns:
A pandas data frame