Pull request review #3770: Trapezoidal distribution
I've added a trapezoidal distribution to numpy.random for consideration, pull request 3770: https://github.com/numpy/numpy/pull/3770 Similar to the triangular distribution, the trapezoidal distribution may be used where the underlying distribution is not known, but some knowledge of the limits and mode exists. The trapezoidal distribution generalizes the triangular distribution by allowing the modal values to be expressed as a range instead of a point estimate. The trapezoidal distribution implemented, known as the "generalized trapezoidal distribution," has three additional parameters: growth, decay, and boundary ratio. Adjusting these from the default values create trapezoidal-like distributions with non-linear behavior. Examples can be seen in an R vignette ( http://cran.r-project.org/web/packages/trapezoid/vignettes/trapezoid.pdf ), as well as these papers by J.R. van Dorp and colleagues: 1) van Dorp, J. R. and Kotz, S. (2003) Generalized trapezoidal distributions. Metrika. 58(1):85–97. Preprint available: http://www.seas.gwu.edu/~dorpjr/Publications/JournalPapers/Metrika2003VanDor... 2) van Dorp, J. R., Rambaud, S.C., Perez, J. G., and Pleguezuelo, R. H. (2007) An elicitation procedure for the generalized trapezoidal distribution with a uniform central stage. Decision Analysis Journal. 4:156–166. Preprint available: http://www.seas.gwu.edu/~dorpjr/Publications/JournalPapers/DA2007.pdf The docstring for the proposed numpy.random.trapezoidal() is as follows: """ trapezoidal(left, mode1, mode2, right, size=None, m=2, n=2, alpha=1) Draw samples from the generalized trapezoidal distribution. The trapezoidal distribution is defined by minimum (``left``), lower mode (``mode1``), upper mode (``mode1``), and maximum (``right``) parameters. The generalized trapezoidal distribution adds three more parameters: the growth rate (``m``), decay rate (``n``), and boundary ratio (``alpha``) parameters. The generalized trapezoidal distribution simplifies to the trapezoidal distribution when ``m = n = 2`` and ``alpha = 1``. It further simplifies to a triangular distribution when ``mode1 == mode2``. Parameters ---------- left : scalar Lower limit. mode1 : scalar The value where the first peak of the distribution occurs. The value should fulfill the condition ``left <= mode1 <= mode2``. mode2 : scalar The value where the first peak of the distribution occurs. The value should fulfill the condition ``mode1 <= mode2 <= right``. right : scalar Upper limit, should be larger than or equal to `mode2`. size : int or tuple of ints, optional Output shape. Default is None, in which case a single value is returned. m : scalar, optional Growth parameter. n : scalar, optional Decay parameter. alpha : scalar, optional Boundary ratio parameter. Returns ------- samples : ndarray or scalar The returned samples all lie in the interval [left, right]. Notes ----- With ``left``, ``mode1``, ``mode2``, ``right``, ``m``, ``n``, and ``alpha`` parametrized as :math:`a, b, c, d, m, n, \\text{ and } \\alpha`, respectively, the probability density function for the generalized trapezoidal distribution is .. math:: f{\\scriptscriptstyle X}(x\mid\theta) = \\mathcal{C}(\\Theta) \\times \\begin{cases} \\alpha \\left(\\frac{x - \\alpha}{b - \\alpha} \\right)^{m - 1}, & \\text{for } a \\leq x < b \\\\ (1 - \\alpha) \\left(\frac{x - b}{c - b} \\right) + \\alpha, & \\text{for } b \\leq x < c \\\\ \\left(\\frac{d - x}{d - c} \\right)^{n-1}, & \\text{for } c \\leq x \\leq d \\end{cases} with the normalizing constant :math:`\\mathcal{C}(\\Theta)` defined as ..math:: \\mathcal{C}(\\Theta) = \\frac{2mn} {2 \\alpha \\left(b - a\\right) n + \\left(\\alpha + 1 \\right) \\left(c - b \\right)mn + 2 \\left(d - c \\right)m} and where the parameter vector :math:`\\Theta = \\{a, b, c, d, m, n, \\alpha \\}, \\text{ } a \\leq b \\leq c \\leq d, \\text{ and } m, n, \\alpha >0`. Similar to the triangular distribution, the trapezoidal distribution may be used where the underlying distribution is not known, but some knowledge of the limits and mode exists. The trapezoidal distribution generalizes the triangular distribution by allowing the modal values to be expressed as a range instead of a point estimate. The growth, decay, and boundary ratio parameters of the generalized trapezoidal distribution further allow for non-linear behavior to be specified. References ---------- .. [1] van Dorp, J. R. and Kotz, S. (2003) Generalized trapezoidal distributions. Metrika. 58(1):85–97. Preprint available: http://www.seas.gwu.edu/~dorpjr/Publications/JournalPapers/Metrika2003VanDor... .. [2] van Dorp, J. R., Rambaud, S.C., Perez, J. G., and Pleguezuelo, R. H. (2007) An elicitation proce-dure for the generalized trapezoidal distribution with a uniform central stage. Decision AnalysisJournal. 4:156–166. Preprint available: http://www.seas.gwu.edu/~dorpjr/Publications/JournalPapers/DA2007.pdf Examples -------- Draw values from the distribution and plot the histogram: >>> import matplotlib.pyplot as plt >>> h = plt.hist(np.random.triangular(0, 0.25, 0.75, 1, 100000), bins=200, ... normed=True) >>> plt.show() """ I am unsure if NumPy encourages incorporation of new distributions into numpy.random or instead into separate modules, but found the exercise to be helpful regardless. Thanks, Jeremy
On Sat, Sep 21, 2013 at 1:55 PM, Jeremy Hetzel <jthetzel@gmail.com> wrote:
I've added a trapezoidal distribution to numpy.random for consideration, pull request 3770: https://github.com/numpy/numpy/pull/3770
Similar to the triangular distribution, the trapezoidal distribution may be used where the underlying distribution is not known, but some knowledge of the limits and mode exists. The trapezoidal distribution generalizes the triangular distribution by allowing the modal values to be expressed as a range instead of a point estimate.
The trapezoidal distribution implemented, known as the "generalized trapezoidal distribution," has three additional parameters: growth, decay, and boundary ratio. Adjusting these from the default values create trapezoidal-like distributions with non-linear behavior. Examples can be seen in an R vignette ( http://cran.r-project.org/web/packages/trapezoid/vignettes/trapezoid.pdf ), as well as these papers by J.R. van Dorp and colleagues:
1) van Dorp, J. R. and Kotz, S. (2003) Generalized trapezoidal distributions. Metrika. 58(1):85–97. Preprint available: http://www.seas.gwu.edu/~dorpjr/Publications/JournalPapers/Metrika2003VanDor...
2) van Dorp, J. R., Rambaud, S.C., Perez, J. G., and Pleguezuelo, R. H. (2007) An elicitation procedure for the generalized trapezoidal distribution with a uniform central stage. Decision Analysis Journal. 4:156–166. Preprint available: http://www.seas.gwu.edu/~dorpjr/Publications/JournalPapers/DA2007.pdf
The docstring for the proposed numpy.random.trapezoidal() is as follows:
""" trapezoidal(left, mode1, mode2, right, size=None, m=2, n=2, alpha=1)
Draw samples from the generalized trapezoidal distribution.
The trapezoidal distribution is defined by minimum (``left``), lower mode (``mode1``), upper mode (``mode1``), and maximum (``right``) parameters. The generalized trapezoidal distribution adds three more parameters: the growth rate (``m``), decay rate (``n``), and boundary ratio (``alpha``) parameters. The generalized trapezoidal distribution simplifies to the trapezoidal distribution when ``m = n = 2`` and ``alpha = 1``. It further simplifies to a triangular distribution when ``mode1 == mode2``.
Parameters ---------- left : scalar Lower limit. mode1 : scalar The value where the first peak of the distribution occurs. The value should fulfill the condition ``left <= mode1 <= mode2``. mode2 : scalar The value where the first peak of the distribution occurs. The value should fulfill the condition ``mode1 <= mode2 <= right``. right : scalar Upper limit, should be larger than or equal to `mode2`. size : int or tuple of ints, optional Output shape. Default is None, in which case a single value is returned. m : scalar, optional Growth parameter. n : scalar, optional Decay parameter. alpha : scalar, optional Boundary ratio parameter.
Returns ------- samples : ndarray or scalar The returned samples all lie in the interval [left, right].
Notes ----- With ``left``, ``mode1``, ``mode2``, ``right``, ``m``, ``n``, and ``alpha`` parametrized as :math:`a, b, c, d, m, n, \\text{ and } \\alpha`, respectively, the probability density function for the generalized trapezoidal distribution is
.. math:: f{\\scriptscriptstyle X}(x\mid\theta) = \\mathcal{C}(\\Theta) \\times \\begin{cases} \\alpha \\left(\\frac{x - \\alpha}{b - \\alpha} \\right)^{m - 1}, & \\text{for } a \\leq x < b \\\\ (1 - \\alpha) \\left(\frac{x - b}{c - b} \\right) + \\alpha, & \\text{for } b \\leq x < c \\\\ \\left(\\frac{d - x}{d - c} \\right)^{n-1}, & \\text{for } c \\leq x \\leq d \\end{cases}
with the normalizing constant :math:`\\mathcal{C}(\\Theta)` defined as
..math:: \\mathcal{C}(\\Theta) = \\frac{2mn} {2 \\alpha \\left(b - a\\right) n + \\left(\\alpha + 1 \\right) \\left(c - b \\right)mn + 2 \\left(d - c \\right)m}
and where the parameter vector :math:`\\Theta = \\{a, b, c, d, m, n, \\alpha \\}, \\text{ } a \\leq b \\leq c \\leq d, \\text{ and } m, n, \\alpha >0`.
Similar to the triangular distribution, the trapezoidal distribution may be used where the underlying distribution is not known, but some knowledge of the limits and mode exists. The trapezoidal distribution generalizes the triangular distribution by allowing the modal values to be expressed as a range instead of a point estimate. The growth, decay, and boundary ratio parameters of the generalized trapezoidal distribution further allow for non-linear behavior to be specified.
References ---------- .. [1] van Dorp, J. R. and Kotz, S. (2003) Generalized trapezoidal distributions. Metrika. 58(1):85–97. Preprint available: http://www.seas.gwu.edu/~dorpjr/Publications/JournalPapers/Metrika2003VanDor... .. [2] van Dorp, J. R., Rambaud, S.C., Perez, J. G., and Pleguezuelo, R. H. (2007) An elicitation proce-dure for the generalized trapezoidal distribution with a uniform central stage. Decision AnalysisJournal. 4:156–166. Preprint available: http://www.seas.gwu.edu/~dorpjr/Publications/JournalPapers/DA2007.pdf
Examples -------- Draw values from the distribution and plot the histogram:
>>> import matplotlib.pyplot as plt >>> h = plt.hist(np.random.triangular(0, 0.25, 0.75, 1, 100000), bins=200, ... normed=True) >>> plt.show()
"""
I am unsure if NumPy encourages incorporation of new distributions into numpy.random or instead into separate modules, but found the exercise to be helpful regardless.
I don't see a reason that numpy.random shouldn't get new distributions. It would also be useful to add the corresponding distribution to scipy.stats. I'm not familiar with the generalized trapezoidal distribution and don't know where it's used, neither have I ever used triangular. naming: n, m would indicate to me that they are integers, but it they can be floats (>0) alpha, beta ? about the parameterization - no problem here Is there a standard version, e.g. left=0, right=1, mode1=?, ... ? In scipy.stats.distribution we are required to use a location, scale parameterization, where loc shifts the distribution and scale stretches it. Is there a standard parameterization for that?, for example left = loc = 0 (default) or left = loc / scale = 0 right = scale = 1 (default) mode1_relative = mode1 / scale mode2_relative = mode2 / scale n, m unchanged no defaults just checked: your naming corresponds to triangular, and triang in scipy has the corresponding loc-scale parameterization. Josef
Thanks, Jeremy
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Sun, Sep 22, 2013 at 1:24 PM, <josef.pktd@gmail.com> wrote:
On Sat, Sep 21, 2013 at 1:55 PM, Jeremy Hetzel <jthetzel@gmail.com> wrote:
I've added a trapezoidal distribution to numpy.random for consideration, pull request 3770: https://github.com/numpy/numpy/pull/3770
Similar to the triangular distribution, the trapezoidal distribution may be used where the underlying distribution is not known, but some knowledge of the limits and mode exists. The trapezoidal distribution generalizes the triangular distribution by allowing the modal values to be expressed as a range instead of a point estimate.
The trapezoidal distribution implemented, known as the "generalized trapezoidal distribution," has three additional parameters: growth, decay, and boundary ratio. Adjusting these from the default values create trapezoidal-like distributions with non-linear behavior. Examples can be seen in an R vignette ( http://cran.r-project.org/web/packages/trapezoid/vignettes/trapezoid.pdf), as well as these papers by J.R. van Dorp and colleagues:
1) van Dorp, J. R. and Kotz, S. (2003) Generalized trapezoidal distributions. Metrika. 58(1):85–97. Preprint available:
http://www.seas.gwu.edu/~dorpjr/Publications/JournalPapers/Metrika2003VanDor...
2) van Dorp, J. R., Rambaud, S.C., Perez, J. G., and Pleguezuelo, R. H. (2007) An elicitation procedure for the generalized trapezoidal
with a uniform central stage. Decision Analysis Journal. 4:156–166. Preprint available: http://www.seas.gwu.edu/~dorpjr/Publications/JournalPapers/DA2007.pdf
The docstring for the proposed numpy.random.trapezoidal() is as follows:
""" trapezoidal(left, mode1, mode2, right, size=None, m=2, n=2, alpha=1)
Draw samples from the generalized trapezoidal distribution.
The trapezoidal distribution is defined by minimum (``left``), lower mode (``mode1``), upper mode (``mode1``), and maximum (``right``) parameters. The generalized trapezoidal distribution adds three more parameters: the growth rate (``m``), decay rate (``n``), and boundary ratio (``alpha``) parameters. The generalized trapezoidal distribution simplifies to the trapezoidal distribution when ``m = n = 2`` and ``alpha = 1``. It further simplifies to a triangular distribution when ``mode1 == mode2``.
Parameters ---------- left : scalar Lower limit. mode1 : scalar The value where the first peak of the distribution occurs. The value should fulfill the condition ``left <= mode1 <= mode2``. mode2 : scalar The value where the first peak of the distribution occurs. The value should fulfill the condition ``mode1 <= mode2 <= right``. right : scalar Upper limit, should be larger than or equal to `mode2`. size : int or tuple of ints, optional Output shape. Default is None, in which case a single value is returned. m : scalar, optional Growth parameter. n : scalar, optional Decay parameter. alpha : scalar, optional Boundary ratio parameter.
Returns ------- samples : ndarray or scalar The returned samples all lie in the interval [left, right].
Notes ----- With ``left``, ``mode1``, ``mode2``, ``right``, ``m``, ``n``, and ``alpha`` parametrized as :math:`a, b, c, d, m, n, \\text{ and } \\alpha`, respectively, the probability density function for the generalized trapezoidal distribution is
.. math:: f{\\scriptscriptstyle X}(x\mid\theta) = \\mathcal{C}(\\Theta) \\times \\begin{cases} \\alpha \\left(\\frac{x - \\alpha}{b - \\alpha} \\right)^{m - 1}, & \\text{for } a \\leq x < b \\\\ (1 - \\alpha) \\left(\frac{x - b}{c - b} \\right) + \\alpha, & \\text{for } b \\leq x < c \\\\ \\left(\\frac{d - x}{d - c} \\right)^{n-1}, & \\text{for } c \\leq x \\leq d \\end{cases}
with the normalizing constant :math:`\\mathcal{C}(\\Theta)` defined as
..math:: \\mathcal{C}(\\Theta) = \\frac{2mn} {2 \\alpha \\left(b - a\\right) n + \\left(\\alpha + 1 \\right) \\left(c - b \\right)mn + 2 \\left(d - c \\right)m}
and where the parameter vector :math:`\\Theta = \\{a, b, c, d, m, n, \\alpha \\}, \\text{ } a \\leq b \\leq c \\leq d, \\text{ and } m, n, \\alpha >0`.
Similar to the triangular distribution, the trapezoidal distribution may be used where the underlying distribution is not known, but some knowledge of the limits and mode exists. The trapezoidal distribution generalizes the
distribution by allowing the modal values to be expressed as a range instead of a point estimate. The growth, decay, and boundary ratio parameters of the generalized trapezoidal distribution further allow for non-linear behavior to be specified.
References ---------- .. [1] van Dorp, J. R. and Kotz, S. (2003) Generalized
distribution triangular trapezoidal
distributions. Metrika. 58(1):85–97. Preprint available:
http://www.seas.gwu.edu/~dorpjr/Publications/JournalPapers/Metrika2003VanDor...
.. [2] van Dorp, J. R., Rambaud, S.C., Perez, J. G., and Pleguezuelo, R. H. (2007) An elicitation proce-dure for the generalized trapezoidal distribution with a uniform central stage. Decision AnalysisJournal. 4:156–166. Preprint available: http://www.seas.gwu.edu/~dorpjr/Publications/JournalPapers/DA2007.pdf
Examples -------- Draw values from the distribution and plot the histogram:
>>> import matplotlib.pyplot as plt >>> h = plt.hist(np.random.triangular(0, 0.25, 0.75, 1, 100000), bins=200, ... normed=True) >>> plt.show()
"""
I am unsure if NumPy encourages incorporation of new distributions into numpy.random or instead into separate modules, but found the exercise to
be
helpful regardless.
I don't see a reason that numpy.random shouldn't get new distributions. It would also be useful to add the corresponding distribution to scipy.stats.
I'm not familiar with the generalized trapezoidal distribution and don't know where it's used, neither have I ever used triangular.
naming: n, m would indicate to me that they are integers, but it they can be floats (>0) alpha, beta ?
about the parameterization - no problem here
Is there a standard version, e.g. left=0, right=1, mode1=?, ... ?
In scipy.stats.distribution we are required to use a location, scale parameterization, where loc shifts the distribution and scale stretches it. Is there a standard parameterization for that?, for example left = loc = 0 (default) or left = loc / scale = 0 right = scale = 1 (default) mode1_relative = mode1 / scale mode2_relative = mode2 / scale n, m unchanged no defaults
just checked: your naming corresponds to triangular, and triang in scipy has the corresponding loc-scale parameterization.
Josef
Thanks, Jeremy
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
I think you need to s/first/second in the description of the mode2 parameter?
On Sun, Sep 22, 2013 at 9:47 AM, Mark Szepieniec <mszepien@gmail.com> wrote:
On Sun, Sep 22, 2013 at 1:24 PM, <josef.pktd@gmail.com> wrote:
I don't see a reason that numpy.random shouldn't get new distributions. It would also be useful to add the corresponding distribution to scipy.stats.
I have the pdf, cdf, and inverse cdf for the generalized trapezoidal. I've looked through the other distributions at scipy.stats and adding this one should not be difficult. I'll work on it next.
naming: n, m would indicate to me that they are integers, but it they can be floats (>0) alpha, beta ?
The three additional parameters for growth rate, decay rate, and boundary ratio are floats > 0. I renamed them from `m`, `n`, and `alpha` (which is how they're parameterized in the published probability density function) to simply `growth`, `decay`, and `ratio`. Does that fit into the NumPy style? It feels intuitive to me.
Is there a standard version, e.g. left=0, right=1, mode1=?, ... ?
In scipy.stats.distribution we are required to use a location, scale parameterization, where loc shifts the distribution and scale stretches it. Is there a standard parameterization for that?, for example left = loc = 0 (default) or left = loc / scale = 0 right = scale = 1 (default) mode1_relative = mode1 / scale mode2_relative = mode2 / scale n, m unchanged no defaults
just checked: your naming corresponds to triangular, and triang in scipy has the corresponding loc-scale parameterization.
Thanks. There is no standard version of the distribution that I'm aware of, but for the purposes of scipy.stats, left=0, right=1 and mode1, mode2 being either 0.25, 0.75 or 1/3, 2/3, seem reasonable. I'll give more thought to the location and scale and send an email to scipy-dev if I need guidance. Looking at scipy.stats.triang, my initial thought is: left_relative = loc mode1_relative = loc + mode1*scale mode2_relative = loc + mode2*scale right_relative = loc + scale growth, decay, and ratio are unchanged.
I think you need to s/first/second in the description of the mode2 parameter?
Thanks for catching that. Fixed in a recent commit. mode2 should be the second peak of the distribution. Jeremy
On Mon, Sep 23, 2013 at 1:40 PM, Jeremy Hetzel <jthetzel@gmail.com> wrote:
On Sun, Sep 22, 2013 at 9:47 AM, Mark Szepieniec <mszepien@gmail.com> wrote:
On Sun, Sep 22, 2013 at 1:24 PM, <josef.pktd@gmail.com> wrote:
I don't see a reason that numpy.random shouldn't get new distributions. It would also be useful to add the corresponding distribution to scipy.stats.
I have the pdf, cdf, and inverse cdf for the generalized trapezoidal. I've looked through the other distributions at scipy.stats and adding this one should not be difficult. I'll work on it next.
Thank you
naming: n, m would indicate to me that they are integers, but it they can be floats (>0) alpha, beta ?
The three additional parameters for growth rate, decay rate, and boundary ratio are floats > 0. I renamed them from `m`, `n`, and `alpha` (which is how they're parameterized in the published probability density function) to simply `growth`, `decay`, and `ratio`. Does that fit into the NumPy style? It feels intuitive to me.
`growth`, `decay`, and `ratio` sounds much better we try also in scipy.stats to move away from some of the one letter argument names.
Is there a standard version, e.g. left=0, right=1, mode1=?, ... ?
In scipy.stats.distribution we are required to use a location, scale parameterization, where loc shifts the distribution and scale stretches it. Is there a standard parameterization for that?, for example left = loc = 0 (default) or left = loc / scale = 0 right = scale = 1 (default) mode1_relative = mode1 / scale mode2_relative = mode2 / scale n, m unchanged no defaults
just checked: your naming corresponds to triangular, and triang in scipy has the corresponding loc-scale parameterization.
Thanks. There is no standard version of the distribution that I'm aware of, but for the purposes of scipy.stats, left=0, right=1 and mode1, mode2 being either 0.25, 0.75 or 1/3, 2/3, seem reasonable. I'll give more thought to the location and scale and send an email to scipy-dev if I need guidance. Looking at scipy.stats.triang, my initial thought is: left_relative = loc mode1_relative = loc + mode1*scale mode2_relative = loc + mode2*scale right_relative = loc + scale growth, decay, and ratio are unchanged.
mode1 and mode2 don't need a default, they can be shape parameters which don't have defaults in scipy.stats. with left=0, right=1 hard coded in the formulas, we have a "standard" version and get the transformation with loc and scale The implied parameterization looks good, terminology ? mode1, mode2 are "relative" to right - left, based on 0,1 interval (in fractions of left - right length) your `xxx_relative` are the actual values on the real line, i.e. not relative to loc and scale (It's actually the same as with triang, which I had forgotten to look at initially.) Josef
I think you need to s/first/second in the description of the mode2 parameter?
Thanks for catching that. Fixed in a recent commit. mode2 should be the second peak of the distribution.
Jeremy
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (3)
-
Jeremy Hetzel -
josef.pktd@gmail.com -
Mark Szepieniec