基数排序算法说明

问题描述

我对编程不熟悉。我正在寻找C ++中的基数排序实现,我发现了这一点 代码在这里

void countSort(string a[],int size,size_t k)
{
    string *b = NULL; int *c = NULL;
    b = new string[size];
    c = new int[257];

    for (int i = 0; i <257; i++){
        c[i] = 0;   
    }

    for (int j = 0; j <size; j++){   
        c[k < a[j].size() ? (int)(unsigned char)a[j][k] + 1 : 0]++;
        //a[j] is a string
    }

    for (int f = 1; f <257; f++){
        c[f] += c[f - 1];
    }

    for (int r = size - 1; r >= 0; r--){
        b[c[k < a[r].size() ? (int)(unsigned char)a[r][k] + 1 : 0] - 1] = a[r];
        c[k < a[r].size() ? (int)(unsigned char)a[r][k] + 1 : 0]--;
    }

    for (int l = 0; l < size; L++){
        a[l] = b[l];
    }

    // avold memory leak
    delete[] b;
    delete[] c;
}
void radixSort(string b[],int r)
{
    size_t max = getMax(b,r);
    for (size_t digit = max; digit > 0; digit--){ 
        countSort(b,r,digit - 1);
    }
}

所以我的问题是这些行的作用:

c[k < a[j].size() ? (int)(unsigned char)a[j][k] + 1 : 0]++;
b[c[k < a[r].size() ? (int)(unsigned char)a[r][k] + 1 : 0] - 1] = a[r];
c[k < a[r].size() ? (int)(unsigned char)a[r][k] + 1 : 0]--;

那是MSD还是LSD基数排序?

谢谢。

解决方法

这是不必要的紧凑代码的精巧示例,因此很难阅读代码。

对其进行分析有助于将其分开:

// what a mess...
c[k < a[j].size() ? (int)(unsigned char)a[j][k] + 1 : 0]++;

首先取出c的订阅参数:

// determine index for c
const int iC = k < a[j].size() ? (int)(unsigned char)a[j][k] + 1 : 0;
// post-increment c (as it is it could become a pre-increment as well)
c[iC]++;

索引计算包含一个条件:

// determine index for c
const int iC
  // check whether k is (not) exceeding the size of a
  = k < a[j].size()
  // then 
  ? (int)(unsigned char)a[j][k] + 1
  // else
  : 0;

数组astd::string的数组,其中std::string本身包含char的数组。因此,a[j][k]产生单个charchar可以是有符号的也可以是无符号的–留给编译器处理。因此,(unsigned char)a[j][k]不会更改该char的位,而是将它们解释为无符号数。然后(int)(unsigned char)a[j][k]将其提升为int

请注意,如果当前的编译器已对(int)a[j][k]进行了签名,则这可能与char不同,因为在这种情况下,将保留值的可能符号。 (这称为sign extension。)因此,整个过程只负责将当前字符转换为(正)索引并最终加1。


实际上,我打算将其余内容留给读者练习,但是后来我看到了:

b[c[k < a[r].size() ? (int)(unsigned char)a[r][k] + 1 : 0] - 1] = a[r];

像上面一样将其分离,结果为:

const int iC = k < a[r].size() ? (int)(unsigned char)a[r][k] + 1 : 0;
const int iB = c[iC - 1]; // What?
b[iB] = a[r];

考虑到iC可能导致0(尽管我没有检查整个代码是否完全可以),iC - 1可能导致-1。因此,c[-1]将被访问。

例如,这可能是正确的。 c指向更大数组的指针,但不在数组的开头。因此,负索引将访问有效存储。这里似乎不是这样:

c = new int[257];

并且我看不到对c的其他任何分配。

这看起来并不值得信赖。充其量,这种情况过于悲观,从不分配0。


我非常确定我可以证明,如果紧凑的代码无助于更轻松地发现其中可能存在的问题,则可以提高可读性。

那么,非紧凑代码会更慢吗? 根据我的经验,其惊人的优化功能并不适用于现代编译器。

我曾经读过一篇关于优化和Static single assignment form的文章。 同样,当我调试C ++代码时,我不时在Visual Studio调试器监视窗口中看到所有有趣的$$变量(它绝对不包含任何名为$$的变量)。 因此,我相信编译器也会在内部做类似的事情。 –明确地这样做以提高可读性应该不会对性能产生最小的影响。

如果真的有疑问,我仍然可以检查汇编器输出。 (例如,Compiler Explorer是一个好地方。)


顺便说一句。 c = new int[257];

为什么不int c[257];

257个int值并不使我担心立即超过堆栈大小。

更不用说,数组,尤其是用new分配的数组,实际上是糟糕的C ++风格,要求使用U.B.。好像尚未发明std::vector


在我还是学生的时候,我就以某种方式错过了有关Radix排序的课程(尽管我必须承认,我还没有在日常业务中错过这些知识)。 因此,出于好奇,我浏览了Wikipedia,并重新实现了那里的描述。 旨在提供(希望更好)替换问题中发现和公开的OP。

因此,我实现了

  1. 根据en.wikipedia.org: Radix sort – History上的描述的幼稚方法
  2. 然后OP展示了我在de.wikipedia.org: Countingsort – Algorithmus上发现的方法(带有计数排序)。
#include <iostream>
#include <sstream>
#include <string>
#include <vector>

/* helper to find max. length in data strings
 */
size_t maxLength(const std::vector<std::string> &data)
{
  size_t lenMax = 0;
  for (const std::string &value : data) {
    if (lenMax < value.size()) lenMax = value.size();
  }
  return lenMax;
}

/* a naive implementation of radix sort
 * like described in https://en.wikipedia.org/wiki/Radix_sort
 */
void radixSort(std::vector<std::string> &data)
{
  /* A char has 8 bits - which encode (unsigned) the numbers of [0,255].
   * Hence,256 buckets are used for sorting.
   */
  std::vector<std::string> buckets[256];
  // determine max. length of input data:
  const size_t len = maxLength(data);
  /* iterate over data for according to max. length
   */
  for (size_t i = len; i--;) { // i-- -> check for 0 and post-decrement
    // sort data into buckets according to the current "digit":
    for (std::string &value : data) {
      /* digits after end of string are considered as '\0'
       * because 0 is the usual end-marker of C strings
       * and the least possible value of an unsigned char.
       * This shall ensure that an string goes before a longer
       * string with same prefix.
       */
      const unsigned char digit = i < value.size() ? value[i] : '\0';
      // move current string into the corresponding bucket
      buckets[digit].push_back(std::move(value));
    }
    // store buckets back into data (preserving current order)
    data.clear();
    for (std::vector<std::string> &bucket : buckets) {
      // append bucket to the data
      data.insert(data.end(),std::make_move_iterator(bucket.begin()),std::make_move_iterator(bucket.end()));
      bucket.clear();
    }
  }
}

/* counting sort as helper for the not so naive radix sort
 */
void countSort(std::vector<std::string> &data,size_t i)
{
  /* There are 256 possible values for an unsigned char
   * (which may have a value in [0,255]).
   */
  size_t counts[256] = { 0 }; // initialize all counters with 0.
  // count how often a certain charater appears at the place i
  for (const std::string &value : data) {
    /* digits after end of string are considered as '\0'
     * because 0 is the usual end-marker of C strings
     * and the least possible value of an unsigned char.
     * This shall ensure that an string goes before a longer
     * string with same prefix.
     */
    const unsigned char digit = i < value.size() ? value[i] : '\0';
    // count the resp. bucket counter
    ++counts[digit];
  }
  // turn counts of digits into offsets in data
  size_t total = 0;
  for (size_t &count : counts) {
#if 0 // could be compact (and,maybe,confusing):
    total = count += total; // as C++ assignment is right-associative
#else // but is the same as:
    count += total; // add previous total sum to count
    total = count; // remember new total
#endif // 0
  }
  // an auxiliary buffer to sort the input data into.
  std::vector<std::string> buffer(data.size());
  /* Move input into aux. buffer
   * while using the bucket offsets (the former counts)
   * for addressing of new positions.
   * This is done backwards intentionally as the offsets
   * are decremented from end to begin of partitions.
   */
  for (size_t j = data.size(); j--;) { // j-- -> check for 0 and post-decrement
    std::string &value = data[j];
    // see comment for digit above...
    const unsigned char digit = i < value.size() ? value[i] : '\0';
    /* decrement offset and use as index
     * Arrays (and vectors) in C++ are 0-based.
     * Hence,this is adjusted respectively (compared to the source of algorithm).
     */
    const size_t k = --counts[digit];
    // move input element into auxiliary buffer at the determined offset
    buffer[k] = std::move(value);
  }
  /* That's it.
   * Move aux. buffer back into data.
   */
  data = std::move(buffer);
}

/* radix sort using count sort internally
 */
void radixCountSort(std::vector<std::string> &data)
{
  // determine max. length of input data:
  const size_t len = maxLength(data);
  /* iterate over data according to max. length
   */
  for (size_t i = len; i--;) { // i-- -> check for 0 and post-decrement
    countSort(data,i);
  }
}

/* output of vector with strings
 */
std::ostream& operator<<(std::ostream &out,const std::vector<std::string> &data)
{
  const char *sep = " ";
  for (const std::string &value : data) {
    out << sep << '"' << value << '"';
    sep = ",";
  }
  return out;
}

/* do a test for certain data
 */
void test(const std::vector<std::string> &data)
{
  std::cout << "Data: {" << data << " }\n";
  std::vector<std::string> data1 = data;
  radixSort(data1);
  std::cout << "Radix Sorted:       {" << data1 << " }\n";
  std::vector<std::string> data2 = data;
  radixCountSort(data2);
  std::cout << "Radix Count Sorted: {" << data2 << " }\n";
}

/* helper to turn a text into a vector of strings
 * (by separating at white spaces)
 */
std::vector<std::string> tokenize(const char *text)
{
  std::istringstream in(text);
  std::vector<std::string> tokens;
  for (std::string token; in >> token;) tokens.push_back(token);
  return tokens;
}

/* main program
 */
int main()
{
  // do some tests:
  test({ "Hi","He","Hello","World","Wide","Web" });
  test({ });
  test(
    tokenize(
      "Radix sort dates back as far as 1887 to the work of Herman Hollerith on tabulating machines.\n"
      "Radix sorting algorithms came into common use as a way to sort punched cards as early as 1923.\n"
      "The first memory-efficient computer algorithm was developed in 1954 at MIT by Harold H. Seward.\n"
      "Computerized radix sorts had previously been dismissed as impractical "
      "because of the perceived need for variable allocation of buckets of unknown size.\n"
      "Seward's innovation was to use a linear scan to determine the required bucket sizes and offsets beforehand,"
      "allowing for a single static allocation of auxiliary memory.\n"
      "The linear scan is closely related to Seward's other algorithm - counting sort."));
}

输出:

Data: { "Hi","Web" }
Radix Sorted:       { "He","Hi","Web","World" }
Radix Count Sorted: { "He","World" }
Data: { }
Radix Sorted:       { }
Radix Count Sorted: { }
Data: { "Radix","sort","dates","back","as","far","1887","to","the","work","of","Herman","Hollerith","on","tabulating","machines.","Radix","sorting","algorithms","came","into","common","use","a","way","punched","cards","early","1923.","The","first","memory-efficient","computer","algorithm","was","developed","in","1954","at","MIT","by","Harold","H.","Seward.","Computerized","radix","sorts","had","previously","been","dismissed","impractical","because","perceived","need","for","variable","allocation","buckets","unknown","size.","Seward's","innovation","linear","scan","determine","required","bucket","sizes","and","offsets","beforehand,","allowing","single","static","auxiliary","memory.","is","closely","related","other","-","counting","sort." }
Radix Sorted:       { "-","sort.","work" }
Radix Count Sorted: { "-","work" }

Live Demo on coliru

请注意,对字符串进行了排序以解释字符的数值。 相反,如果要使用英语词典排序,则必须修改数字到存储桶的映射。因此,字符值的顺序可能会更改,并将相应的大写和小写字符映射到同一存储桶。

经常复制字符串(或其他容器)会占用空间,而且有些事情,我充其量只能避免生产代码。 move semantics是一种降低CPU压力的方法,同时保持代码的干净度和可比性。 这是我试图(据我所知)示例代码中的内容。