问题描述
我正在尝试使用GEOquery从GEO加载微阵列数据进行分析。当我使用以下代码时,将省略带有样品名称的行。相反,它使用带有表达式值的数据行作为标题。您能帮我解决这个问题吗? 谢谢
library(GEOquery)
gset <- getGEO("GSE1729",GSEMatrix =TRUE)
if (length(gset) > 1) idx <- grep("GPL96",attr(gset,"names")) else idx <- 1
gset <- gset[[idx]]
gset
输出:
Parsed with column specification:
.default = col_double(),**`1007_s_at` = col_character()**
)
See spec(...) for full column specifications.
|=================================================================================| 100% 4 MB
Warning: 68 parsing failures.
row col expected actual file
22216 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
22217 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
22218 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
22219 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
22220 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
..... ....... .................. ......... ............
See problems(...) for more details.
ExpressionSet (storageMode: lockedEnvironment)
assayData: 22282 features,43 samples
element names: exprs
protocolData: none
phenoData
**sampleNames: 71 55.4 ... 84.8 (43 total)**
varLabels: title geo_accession ... data_row_count (26 total)
varMetadata: labelDescription
featureData
featureNames: 1053_at 117_at ... AFFX-TrpnX-M_at (22282 total)
fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16 total)
fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
pubMedIds: 15674361
Annotation: GPL96
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] parallel stats graphics Grdevices utils datasets methods base
other attached packages:
[1] GEOquery_2.56.0 Biobase_2.48.0 Biocgenerics_0.34.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 tidyr_1.1.1 Crayon_1.3.4 dplyr_1.0.1 R6_2.4.1
[6] lifecycle_0.2.0 magrittr_1.5 pillar_1.4.6 rlang_0.4.7 curl_4.3
[11] rstudioapi_0.11 limma_3.44.3 xml2_1.3.2 vctrs_0.3.2 generics_0.0.2
[16] ellipsis_0.3.1 tools_4.0.2 readr_1.3.1 glue_1.4.1 purrr_0.3.4
[21] hms_0.5.3 compiler_4.0.2 pkgconfig_2.0.3 tidyselect_1.1.0 tibble_3.0.3
解决方法
该文件(ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE11nnn/GSE11121/matrix/GSE11121_series_matrix.txt.gz)中的第8行似乎被弄乱了/在第二行(加引号)中包含控制字符,这会影响正确的解析:
!Series_title "Gene expression profile of acute myeloid leukemia"
!Series_geo_accession "GSE1729"
!Series_status "Public on Jan 26 2005"
!Series_submission_date "Sep 06 2004"
!Series_last_update_date "Aug 10 2018"
!Series_pubmed_id "15674361"
!Series_summary "Gene expression profile of acute myeloid leukemia."
!Series_summary "^M"
...
如果您下载并解压缩该文件(例如,在Windows上使用7-zip),请使用编辑器将其打开,删除该标签,然后再次保存,则可以从本地修改后的副本中以正确的sampleName读取该文件(否需要重新压缩)。
gset <- getGEO(filename="GSE1729_series_matrix.txt",GSEMatrix =TRUE,parseCharacteristics=TRUE)
## check:
sampleNames(gset)