如何使用Python对红外光谱数据进行聚类

问题描述

我一直在使用sklearn聚类方法研究聚类红外光谱数据。我很难让群集处理数据，因为我是新手，所以我不知道编码方式是否错误或方法是否错误。

我的数据以Pandas DataFrame格式显示，如下所示：

Index     Wavenumbers (cm-1)     %Transmission_i   ...
0         650                    100               ... 
.          .                      .                ...
.          .                      .                ...
.          .                      .                ...
n         4000                   95                ...

其中，所有光谱的x轴均为Wavenumbers (cm-1)列，随后的列（%Transmission_i）为实际数据。我想将这些列聚类（就哪些光谱而言彼此最相似），因此我正在尝试这段代码：

X        = np.array([list(df[x].values) for x in df.set_index(x)])
clusters = DBSCAN().fit(X)

其中df是我的DataFrame，而np是numpy（很明显）。问题是，当我打印出群集标签时，它只会吐出-1以外的所有内容，这意味着我所有的数据都是噪音。情况并非如此，当我绘制数据时，我可以清楚地看到一些光谱看起来非常相似（应该如此）。

如何使相似光谱正确聚类？

编辑：这是一个最小的工作示例。

import numpy as np
import pandas as pd
import sklearn as sk
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN

x = 'x-vals'

def cluster_data(df):

    avg_list = []
    dif_list = []
    for col in df:
        if x == col:
            continue
        avg_list.append(np.mean(df[col].values))
        dif_list.append(np.mean(np.diff(df[col].values)))

    a = sk.preprocessing.normalize([avg_list],norm='max')[0]
    b = sk.preprocessing.normalize([dif_list],norm='max')[0]

    X = []
    for i,j in zip(a,b):
        X.append([i,j])

    X = np.array(X)
    clusters = DBSCAN(eps=0.2).fit(X)

    return clusters.labels_

def plot_clusters(df,clusters):
    colors = ['red','green','blue','black','pink']
    i      = 0
    for col in df:
        if col == x:
            continue
        color = colors[clusters[i]]
        plt.plot(df[x],df[col],color=color)
        i +=1
    plt.show()


x1  = np.linspace(-np.pi,np.pi,201)
y1  = np.sin(x1) + 1
y2  = np.cos(x1) + 1
y3  = np.zeros_like(x1) + 2
y4  = np.zeros_like(x1) + 1.9
y5  = np.zeros_like(x1) + 1.8
y6  = np.zeros_like(x1) + 1.7
y7  = np.zeros_like(x1) + 1
y8  = np.zeros_like(x1) + 0.9
y9  = np.zeros_like(x1) + 0.8
y10 = np.zeros_like(x1) + 0.7

df  = pd.DataFrame({'x-vals':x1,'y1':y1,'y2':y2,'y3':y3,'y4':y4,'y5':y5,'y6':y6,'y7':y7,'y8':y8,'y9':y9,'y10':y10})

clusters = cluster_data(df)

plot_clusters(df,clusters)

这将产生以下图，其中红色是簇，粉红色是噪声。

解决方法

我能够找到一种有效的方法，但是我不完全相信这是聚类红外光谱的最佳方法。

首先，我遍历所有光谱并编译每个光谱的mean和mean of the first derivative的列表。 mean应该代表光谱的垂直位置，而mean of the first derivative应该代表光谱的形状。

avg_list = []
dif_list = []
for col in df:
    if x == col:
       continue
    avg_list.append(np.mean(df[col].values))
    dif_list.append(np.mean(np.dif(df[col].values)))

然后我将每个列表归一化，这样我就可以根据百分比变化选择一个eps值。

a = sk.preprocessing.normalize([avg_list],norm='max')[0]
b = sk.preprocessing.normalize([diff_list],norm='max')[0]

此后，我创建了一个2D阵列，以2D模式运行DBSCAN。

X = []
for i,j in zip(a,b):
    X.append([i,j])

然后，我使用eps参数的任意百分比差异值来运行DBSCAN集群方法。

X        = np.array(X)
clusters = DBSCAN(eps=0.2).fit(X)

然后clusters.labels_返回一个数组，该数组的长度与我的DataFrame中光谱的数量相同。它工作得很好，但是它是排他性的，集群可能更好。进行一些更精细的调整会有所帮助。

首先，转置数据框，以便按照标准将数据点作为行。它应该看起来像这样：

class Population extends React.Component {
  constructor(props) {
    super(props);
    this.state = {
      info: {},population: 0
    };
    this.getPopulation = this.getPopulation.bind(this);
  }

  getPopulation(name) {
    fetch(`https://wft-geo-db.p.rapidapi.com/v1/geo/cities/${name}`,{
      method: "GET",headers: {
        "x-rapidapi-key": "","x-rapidapi-host": "wft-geo-db.p.rapidapi.com"
      }
    })
      .then((response) => response.json())
      .then((data) => {
        const newInfo = data.data;
        const newPopulation = newInfo.population;
        const newState = Object.assign({},this.state,{
          info: newInfo,population: newPopulation
        });
        this.setState(newState);
        console.log(this.state.info);
      })
      .catch((error) => {
        console.error(error);
      });
  }

  componentDidMount() {
    if (this.props.name) {
      this.getPopulation(this.props.name);
      console.log("The name " + this.props.name);
    }
  }

  componentDidUpdate() {
    if (this.props.name) {
      this.getPopulation(this.props.name);
      console.log("The name " + this.props.name);
    }
  }

  render() {
    return <div className="App">The population is {this.state.population}</div>;
  }
}

然后您得到Index 650 660 ... 4000 0 100 98 ... 95 1 . . ... . . . . ... . n . . ... .的集群信息，如下所示：

接下来，您进行群集：

X = df.values

作为光谱数据的建议，kmeans（缺点：您需要预先设置簇的数量）和自组织图（缺点：软群集而不是硬群集）可以很好地工作。例如，您找到了一个示例here，用于在高光谱数据上进行聚类。

cluster-analysis dbscan python-3.x scikit-learn