问题描述
我已经在C中实现了以单精度(32位浮点)计算的三角函数(sin,cos,arctan)的一些近似值。它们的精确度约为+/- 2 ulp。
我的目标设备不支持任何<cmath>
或<math.h>
方法。它不提供FMA,但提供MAC ALU。 ALU和LU以32位格式进行计算。
我的arctan近似实际上是approximation of N.juffa的修改版本,它在整个范围上近似arctan。正弦和余弦函数在[-pi,pi]范围内精确到2 ulp。
我现在的目标是为正弦和余弦提供更大的输入范围(尽可能大,最好为[FLT_MIN,FLT_MAX]),这使我可以减少参数。
我目前正在阅读A RGUMENT REDUCTION FOR HUGE ARGUMENTS: Good to the Last Bit by K.C.Ng之类的不同论文或与此new argument reduction algorithm相关的论文,但我无法从中获得实现。
我还要提及两个涉及相关问题的stackoverflow问题:有一个approach with matlab and c++,它基于我链接的第一篇论文。它实际上是使用matlab,cmath方法并将输入限制为[0,20.000]。评论中已经提到了另一个。这是一种使用C语言中的sin和cos实现的方法,它使用了我不可用的各种c库。由于这两个职位都已经有几年历史了,因此可能会有一些新发现。
在这种情况下,最常用的算法似乎是精确存储2 / pi的数量直至所需的位数,以便能够准确地计算模运算,同时避免抵消。我的设备没有提供大型DMEM,这意味着无法使用具有数百位的大型查找表。 this参考资料的第70页上实际描述了该过程,顺便提一下,该过程提供了许多有关浮点数学的有用信息。
所以我的问题是:是否有另一种有效的方法来减少正弦和余弦的参数以获得单精度以避免大型LUT?上面提到的论文实际上都是针对双精度的,最多使用1000位数字,这不适合我的用例。
我实际上没有在C中找到任何实现,也没有针对单精度计算的实现,对于任何种类的提示/ links / examples ...,我将不胜感激。
解决方法
以下代码基于previous answer,在该代码中,我演示了如何通过对幅度较小的参数使用Cody-Waite拆分常数方法以及Payne-来对三角函数执行相当准确的参数归约。 Hanek方法用于处理数量较大的论点。有关Payne-Hanek算法的详细信息,请参见此处。有关Cody-Waite算法的详细信息,请参见我的previous answer。
在这里,我进行了必要的调整,以适应asker平台的限制,因为不支持64位类型,不支持融合乘法加法,并且math.h
中的辅助函数不可用。我假设float
映射到IEEE-754 binary32
格式,并且有一种方法可以将这种32位浮点数重新解释为32位无符号整数,反之亦然。我已经通过标准的可移植习惯用法(即使用memcpy()
)来实现了这种重新解释,但是可以选择适合未指定目标平台的其他方法,例如内联汇编,特定于机器的内部函数或易失性联合
由于此代码基本上是我以前的代码在更严格的环境中的移植,因此可能缺少专门针对该环境的 de novo 设计的优雅之处。我基本上已经将frexp()
中的math.h
辅助函数替换为一些纠缠,用成对的32位整数模拟了64位整数计算,并用32位定点替换了双精度计算。计算(效果比我预期的要好得多),并用未融合的等效项替换了所有FMA。
重新处理参数减少的Cody-Waite部分需要大量的工作。显然,在没有FMA的情况下,我们需要确保常数π/ 2的组成部分中有足够的尾随零位(最低有效位除外),以确保乘积正确。我花了几个小时实验性地弄清了一个特定的拆分,该拆分可以提供准确的结果,但也将切换点尽可能地推向Payne-Hanek方法。
指定USE_FMA = 1
时,使用高质量数学库编译的测试应用的输出应类似于以下内容:
Testing sinf ... PASSED. max ulp err = 1.493253 diffsum = 337633490
Testing cosf ... PASSED. max ulp err = 1.495098 diffsum = 342020968
使用USE_FMA = 0
时,精度会稍有变化:
Testing sinf ... PASSED. max ulp err = 1.498012 diffsum = 359699036
Testing cosf ... PASSED. max ulp err = 1.504061 diffsum = 364679288
diffsum
输出是整体准确性的粗略指标,此处显示所有输入中约有90%产生正确的舍入单精度响应。
请注意,使用最严格的浮点设置和对编译器提供的IEEE-754的最高遵守程度来编译代码很重要。对于我用来开发和测试此代码的Intel编译器,可以通过使用/fp:strict
进行编译来实现。同样,用于参考的数学库的质量对于准确评估此单精度代码的ulp误差至关重要。英特尔编译器带有一个数学库,该库提供双精度基本数学函数,在HA(高精度)变体中误差略大于0.5 ulp。使用多精度参考库可能是更可取的选择,但在这里会使我慢很多。
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h> // for memcpy()
#include <math.h> // for test purposes,and when PORTABLE=1 or USE_FMA=1
#define USE_FMA (0) // use fmaf() calls for arithmetic
#define PORTABLE (0) // allow helper functions from math.h
#define CW_STAGES (3) // number of stages in Cody-Waite reduction when USE_FMA=0
#if USE_FMA
#define SIN_RED_SWITCHOVER (117435.992f)
#define COS_RED_SWITCHOVER (71476.0625f)
#define MAX_DIFF (1)
#else // USE_FMA
#if CW_STAGES == 2
#define SIN_RED_SWITCHOVER (3.921875f)
#define COS_RED_SWITCHOVER (3.921875f)
#elif CW_STAGES == 3
#define SIN_RED_SWITCHOVER (201.15625f)
#define COS_RED_SWITCHOVER (142.90625f)
#endif // CW_STAGES
#define MAX_DIFF (2)
#endif // USE_FMA
/* re-interpret the bit pattern of an IEEE-754 float as a uint32 */
uint32_t float_as_uint32 (float a)
{
uint32_t r;
memcpy (&r,&a,sizeof r);
return r;
}
/* re-interpret the bit pattern of a uint32 as an IEEE-754 float */
float uint32_as_float (uint32_t a)
{
float r;
memcpy (&r,sizeof r);
return r;
}
/* compute the upper 32 bits of the product of two unsigned 32-bit integers */
uint32_t umul32_hi (uint32_t a,uint32_t b)
{
/* split operands into halves */
uint32_t al = (uint16_t)a;
uint32_t ah = a >> 16;
uint32_t bl = (uint16_t)b;
uint32_t bh = b >> 16;
/* compute partial products */
uint32_t p0 = al * bl;
uint32_t p1 = al * bh;
uint32_t p2 = ah * bl;
uint32_t p3 = ah * bh;
/* sum partial products */
uint32_t cy = ((p0 >> 16) + (uint16_t)p1 + (uint16_t)p2) >> 16;
return p3 + (p2 >> 16) + (p1 >> 16) + cy;
}
/* 190 bits of 2/PI for Payne-Hanek style argument reduction. */
const uint32_t two_over_pi_f [] =
{
0x28be60db,0x9391054a,0x7f09d5f4,0x7d4d3770,0x36d8a566,0x4f10e410
};
/* reduce a trig function argument using the swlo Paye-Hanek method */
float trig_red_slowpath_f (float a,int *quadrant)
{
uint32_t ia,hi,mid,lo,tmp,i,l,h,plo,phi;
int e,q;
float r;
#if PORTABLE
ia = (uint32_t)(fabsf (frexpf (a,&e)) * 0x1.0p32f); // 4.29496730e+9
#else // PORTABLE
ia = uint32_as_float ((float_as_uint32 (a) & 0x007fffff) + 0x4f000000);
e = ((float_as_uint32 (a) >> 23) & 0xff) - 126;
#endif // PORTABLE
/* compute product x * 2/pi in 2.62 fixed-point format */
i = (uint32_t)e >> 5;
e = (uint32_t)e & 31;
hi = i ? two_over_pi_f [i-1] : 0;
mid = two_over_pi_f [i+0];
lo = two_over_pi_f [i+1];
tmp = two_over_pi_f [i+2];
if (e) {
hi = (hi << e) | (mid >> (32 - e));
mid = (mid << e) | (lo >> (32 - e));
lo = (lo << e) | (tmp >> (32 - e));
}
/* compute 64-bit product phi:plo */
phi = 0;
l = ia * lo;
h = umul32_hi (ia,lo);
plo = phi + l;
phi = h + (plo < l);
l = ia * mid;
h = umul32_hi (ia,mid);
plo = phi + l;
phi = h + (plo < l);
l = ia * hi;
phi = phi + l;
/* split fixed-point result into integer and fraction portions */
q = phi >> 30; // integral portion = quadrant<1:0>
phi = phi & 0x3fffffff; // fraction
if (phi & 0x20000000) { // fraction >= 0.5
phi = phi - 0x40000000; // fraction - 1.0
q = q + 1;
}
/* compute remainder of x / (pi/2) */
#if USE_FMA
float phif,plof,chif,clof,thif,tlof;
phif = 0x1.0p27f * (float)(int32_t)(phi & 0xffffffe0);
plof = (float)((plo >> 5) | (phi << (32-5)));
thif = phif + plof;
plof = (phif - thif) + plof;
phif = thif;
chif = 0x1.921fb6p-57f; // (1.5707963267948966 * 0x1.0p-57)_hi
clof = -0x1.777a5cp-82f; // (1.5707963267948966 * 0x1.0p-57)_lo
thif = phif * chif;
tlof = fmaf (phif,-thif);
tlof = fmaf (phif,tlof);
tlof = fmaf (plof,tlof);
r = thif + tlof;
#else // USE_FMA
/* record sign of fraction */
uint32_t s = phi & 0x80000000;
/* take absolute value of fraction */
if ((int32_t)phi < 0) {
phi = ~phi;
plo = 0 - plo;
phi += (plo == 0);
}
/* normalize fraction */
e = 0;
while ((int32_t)phi > 0) {
phi = (phi << 1) | (plo >> 31);
plo = (plo << 1);
e--;
}
/* multiply 32 high-order bits of fraction with pi/2 */
plo = phi * 0xc90fdaa2; // (uint32_t)rint(PI/2 * 2**31)
phi = umul32_hi (phi,0xc90fdaa2);
/* normalize product */
if ((int)phi > 0) {
phi = (phi << 1) | (plo >> 31);
e--;
}
/* round and convert to floating point */
uint32_t ri = s | (((e + 128) << 23) + ((phi >> 8) + ((phi & 0xff) > 0x7e)));
r = uint32_as_float (ri);
#endif // USE_FMA
if (a < 0.0f) {
r = -r;
q = -q;
}
*quadrant = q;
return r;
}
/* Argument reduction for trigonometric functions that reduces the argument
to the interval [-PI/4,+PI/4] and also returns the quadrant. It returns
-0.0f for an input of -0.0f
*/
float trig_red_f (float a,float switch_over,int *q)
{
float j,r;
if (fabsf (a) > switch_over) {
/* Payne-Hanek style reduction. M. Payne and R. Hanek,"Radian reduction
for trigonometric functions". SIGNUM Newsletter,18:19-24,1983
*/
r = trig_red_slowpath_f (a,q);
} else {
/* Cody-Waite style reduction. W. J. Cody and W. Waite,"Software Manual
for the Elementary Functions",Prentice-Hall 1980
*/
#if USE_FMA
j = fmaf (a,0x1.45f306p-1f,0x1.8p+23f) - 0x1.8p+23f; // 6.36619747e-1,1.25829120e+7
r = fmaf (j,-0x1.921fb0p+00f,a); // -1.57079601e+00 // pio2_high
r = fmaf (j,-0x1.5110b4p-22f,r); // -3.13916473e-07 // pio2_mid
r = fmaf (j,-0x1.846988p-48f,r); // -5.39030253e-15 // pio2_low
#else // USE_FMA
j = (a * 0x1.45f306p-1f + 0x1.8p+23f) - 0x1.8p+23f; // 6.36619747e-1,1.25829120e+7
#if CW_STAGES == 2
r = a - j * 0x1.921fb4p+0f; // pio2_high
r = r - j * 0x1.4442d2p-24f; // pio2_low
#elif CW_STAGES == 3
r = a - j * 0x1.921f00p+00f; // 1.57078552e+00 // pio2_high
r = r - j * 0x1.6a8880p-17f; // 1.08043314e-05 // pio2_mid
r = r - j * 0x1.68c234p-39f; // 2.56334407e-12 // pio2_low
#endif // CW_STAGES
#endif // USE_FMA
*q = (int)j;
}
return r;
}
/* Approximate sine on [-PI/4,+PI/4]. Maximum ulp error with USE_FMA = 0.64196
Returns -0.0f for an argument of -0.0f
Polynomial approximation based on T. Myklebust,"Computing accurate
Horner form approximations to special functions in finite precision
arithmetic",http://arxiv.org/abs/1508.03211,retrieved on 8/29/2016
*/
float sinf_poly (float a,float s)
{
float r,t;
#if USE_FMA
r = 0x1.80a000p-19f; // 2.86567956e-6
r = fmaf (r,s,-0x1.a0690cp-13f); // -1.98559923e-4
r = fmaf (r,0x1.111182p-07f); // 8.33338592e-3
r = fmaf (r,-0x1.555556p-03f); // -1.66666672e-1
t = fmaf (a,0.0f); // ensure -0 is passed through
r = fmaf (r,t,a);
#else // USE_FMA
r = 0x1.80a000p-19f; // 2.86567956e-6
r = r * s - 0x1.a0690cp-13f; // -1.98559923e-4
r = r * s + 0x1.111182p-07f; // 8.33338592e-3
r = r * s - 0x1.555556p-03f; // -1.66666672e-1
t = a * s + 0.0f; // ensure -0 is passed through
r = r * t + a;
#endif // USE_FMA
return r;
}
/* Approximate cosine on [-PI/4,+PI/4]. Maximum ulp error with USE_FMA = 0.87444 */
float cosf_poly (float s)
{
float r;
#if USE_FMA
r = 0x1.9a8000p-16f; // 2.44677067e-5
r = fmaf (r,-0x1.6c0efap-10f); // -1.38877297e-3
r = fmaf (r,0x1.555550p-05f); // 4.16666567e-2
r = fmaf (r,-0x1.000000p-01f); // -5.00000000e-1
r = fmaf (r,0x1.000000p+00f); // 1.00000000e+0
#else // USE_FMA
r = 0x1.9a8000p-16f; // 2.44677067e-5
r = r * s - 0x1.6c0efap-10f; // -1.38877297e-3
r = r * s + 0x1.555550p-05f; // 4.16666567e-2
r = r * s - 0x1.000000p-01f; // -5.00000000e-1
r = r * s + 0x1.000000p+00f; // 1.00000000e+0
#endif // USE_FMA
return r;
}
/* Map sine or cosine value based on quadrant */
float sinf_cosf_core (float a,int i)
{
float r,s;
s = a * a;
r = (i & 1) ? cosf_poly (s) : sinf_poly (a,s);
if (i & 2) {
r = 0.0f - r; // don't change "sign" of NaNs
}
return r;
}
/* maximum ulp error with USE_FMA = 1: 1.495098 */
float my_sinf (float a)
{
float r;
int i;
a = a * 0.0f + a; // inf -> NaN
r = trig_red_f (a,SIN_RED_SWITCHOVER,&i);
r = sinf_cosf_core (r,i);
return r;
}
/* maximum ulp error with USE_FMA = 1: 1.493253 */
float my_cosf (float a)
{
float r;
int i;
a = a * 0.0f + a; // inf -> NaN
r = trig_red_f (a,COS_RED_SWITCHOVER,i + 1);
return r;
}
/* re-interpret bit pattern of an IEEE-754 double as a uint64 */
uint64_t double_as_uint64 (double a)
{
uint64_t r;
memcpy (&r,sizeof r);
return r;
}
double floatUlpErr (float res,double ref)
{
uint64_t i,j,err,refi;
int expoRef;
/* ulp error cannot be computed if either operand is NaN,infinity,zero */
if (isnan (res) || isnan (ref) || isinf (res) || isinf (ref) ||
(res == 0.0f) || (ref == 0.0f)) {
return 0.0;
}
/* Convert the float result to an "extended float". This is like a float
with 56 instead of 24 effective mantissa bits.
*/
i = ((uint64_t)float_as_uint32(res)) << 32;
/* Convert the double reference to an "extended float". If the reference is
>= 2^129,we need to clamp to the maximum "extended float". If reference
is < 2^-126,we need to denormalize because of the float types's limited
exponent range.
*/
refi = double_as_uint64(ref);
expoRef = (int)(((refi >> 52) & 0x7ff) - 1023);
if (expoRef >= 129) {
j = 0x7fffffffffffffffULL;
} else if (expoRef < -126) {
j = ((refi << 11) | 0x8000000000000000ULL) >> 8;
j = j >> (-(expoRef + 126));
} else {
j = ((refi << 11) & 0x7fffffffffffffffULL) >> 8;
j = j | ((uint64_t)(expoRef + 127) << 55);
}
j = j | (refi & 0x8000000000000000ULL);
err = (i < j) ? (j - i) : (i - j);
return err / 4294967296.0;
}
int main (void)
{
float arg,res,reff;
uint32_t argi,resi,refi;
int64_t diff,diffsum;
double ref,ulp,maxulp;
printf ("Testing sinf ... ");
diffsum = 0;
maxulp = 0;
argi = 0;
do {
arg = uint32_as_float (argi);
res = my_sinf (arg);
ref = sin ((double)arg);
reff = (float)ref;
resi = float_as_uint32 (res);
refi = float_as_uint32 (reff);
ulp = floatUlpErr (res,ref);
if (ulp > maxulp) {
maxulp = ulp;
}
#if DEBUGGING
printf (" %08x (% 15.8e): res=%08x (% 15.8e) ref=%016llx (% 23.16e)\n",argi,arg,double_as_uint64 (ref),ref);
#endif // DEBUGGING
diff = llabs ((int64_t)resi - (int64_t)refi);
if (diff > MAX_DIFF) {
printf ("\nerror @ %08x (% 15.8e): res=%08x (% 15.8e) ref=%08x (%15.8e)\n",refi,reff);
return EXIT_FAILURE;
}
diffsum = diffsum + diff;
argi++;
} while (argi);
printf ("PASSED. max ulp err = %.6f diffsum = %lld\n",maxulp,diffsum);
printf ("Testing cosf ... ");
diffsum = 0;
maxulp = 0;
argi = 0;
do {
arg = uint32_as_float (argi);
res = my_cosf (arg);
ref = cos ((double)arg);
reff = (float)ref;
resi = float_as_uint32 (res);
refi = float_as_uint32 (reff);
ulp = floatUlpErr (res,ref);
#endif // DEBUGGING
diff = llabs ((int64_t)resi - (int64_t)refi);
if (diff > MAX_DIFF) {
printf ("\nerror @ %08x (% 15.8e): res=%08x (% 15.8e) ref=%08x (%15.8e)\n",reff);
return EXIT_FAILURE;
}
diffsum = diffsum + diff;
argi++;
} while (argi);
diffsum = diffsum + diff;
printf ("PASSED. max ulp err = %.6f diffsum = %lld\n",diffsum);
return EXIT_SUCCESS;
}