从GEOquery加载微阵列数据时,现在获取样品名称时出错

问题描述

我正在尝试使用GEOquery从GEO加载微阵列数据进行分析。当我使用以下代码时,将省略带有样品名称的行。相反,它使用带有表达式值的数据行作为标题。您能帮我解决这个问题吗? 谢谢

library(GEOquery)    
gset <- getGEO("GSE1729",GSEMatrix =TRUE)   
if (length(gset) > 1) idx <- grep("GPL96",attr(gset,"names")) else idx <- 1   
gset <- gset[[idx]]   
gset

输出

Parsed with column specification:   
  .default = col_double(),**`1007_s_at` = col_character()**
)  

See spec(...) for full column specifications.
|=================================================================================| 100%    4 MB
Warning: 68 parsing failures.  
  row     col           expected    actual         file  
22216 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data   
22217 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data  
22218 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data  
22219 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data  
22220 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data  
..... ....... .................. ......... ............  
See problems(...) for more details.  

ExpressionSet (storageMode: lockedEnvironment)     
assayData: 22282 features,43 samples     
  element names: exprs    
protocolData: none   
phenoData  
  **sampleNames: 71 55.4 ... 84.8 (43 total)**  
  varLabels: title geo_accession ... data_row_count (26 total)  
  varMetadata: labelDescription  
featureData  
  featureNames: 1053_at 117_at ... AFFX-TrpnX-M_at (22282 total)  
  fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16 total)  
  fvarMetadata: Column Description labelDescription  
experimentData: use 'experimentData(object)'  
  pubMedIds: 15674361   
Annotation: GPL96   


> sessionInfo()  
R version 4.0.2 (2020-06-22)  
Platform: x86_64-w64-mingw32/x64 (64-bit)  
Running under: Windows 10 x64 (build 18363)  

Matrix products: default  

locale:  
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252     
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                            
[5] LC_TIME=English_United States.1252      

attached base packages:  
[1] parallel  stats     graphics  Grdevices utils     datasets  methods   base       

other attached packages:  
[1] GEOquery_2.56.0     Biobase_2.48.0      Biocgenerics_0.34.0  

loaded via a namespace (and not attached):  
 [1] Rcpp_1.0.5       tidyr_1.1.1      Crayon_1.3.4     dplyr_1.0.1      R6_2.4.1            
 [6] lifecycle_0.2.0  magrittr_1.5     pillar_1.4.6     rlang_0.4.7      curl_4.3             
[11] rstudioapi_0.11  limma_3.44.3     xml2_1.3.2       vctrs_0.3.2      generics_0.0.2      
[16] ellipsis_0.3.1   tools_4.0.2      readr_1.3.1      glue_1.4.1       purrr_0.3.4         
[21] hms_0.5.3        compiler_4.0.2   pkgconfig_2.0.3  tidyselect_1.1.0 tibble_3.0.3      

解决方法

该文件(ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE11nnn/GSE11121/matrix/GSE11121_series_matrix.txt.gz)中的第8行似乎被弄乱了/在第二行(加引号)中包含控制字符,这会影响正确的解析:

!Series_title   "Gene expression profile of acute myeloid leukemia"
!Series_geo_accession   "GSE1729"
!Series_status  "Public on Jan 26 2005"
!Series_submission_date "Sep 06 2004"
!Series_last_update_date        "Aug 10 2018"
!Series_pubmed_id       "15674361"
!Series_summary "Gene expression profile of acute myeloid leukemia."
!Series_summary "^M"
...

如果您下载并解压缩该文件(例如,在Windows上使用7-zip),请使用编辑器将其打开,删除该标签,然后再次保存,则可以从本地修改后的副本中以正确的sampleName读取该文件(否需要重新压缩)。

gset <- getGEO(filename="GSE1729_series_matrix.txt",GSEMatrix =TRUE,parseCharacteristics=TRUE)

## check:
sampleNames(gset)