当节点只有属性时,如何将XML转换为data.frame?(How to convert XML to data.frame when nodes have only attributes?)

我正在尝试使用XML包和xmlToList或xmlToDataFrame函数。 我的输入数据在互联网上(前两行),我只需要处理XML的某些部分(参见第三个nodeset命令)

url<- 'http://ClinicalTrials.gov/show/NCT00191100?resultsxml=true' xml = xmlTreeParse(url,useInternalNode=TRUE) ns <- getNodeSet(xml, '/clinical_study/clinical_results/reported_events/serious_events/category_list')

它是一个类别列表,内部类别是“事件”。 并且事件有计数(并且计数特定于临床试验组(例如,药物与安慰剂组)

我只需要这些事件,因此最好的列表是使用xmlToList进行龋齿呼吸停止

xl<-xmlToList(url) set2<-xl$clinical_results$reported_events$serious_events$category_list set2[[3]] > set2[[3]] $title [1] "Cardiac disorders" $event_list $event_list$event $event_list$event$sub_title [1] "Cardio-respiratory arrest" $event_list$event$counts group_id events subjects_affected subjects_at_risk "E1" "1" "1" "260" $event_list$event$counts group_id events subjects_affected subjects_at_risk "E2" "0" "0" "255"

由于此错误,我无法使用xmlToDataFrame。 (nodeset2包含XMLattributes中的所有数据,我认为xmlTODataFrame可能不喜欢这个)

hopefulyDF <- getNodeSet(xml, '/clinical_study/clinical_results/reported_events/serious_events/category_list/category/event_list/event/counts') xmlToDataFrame(node = hopefulyDF) Error in matrix(vals, length(nfields), byrow = TRUE) : 'data' must be of a vector type, was 'NULL'

如何最好地提取计数数据? 我尝试取消列表,但我可能没有足够的进步。 我想避免循环和手动xmlGetAttr。 但在最坏的情况下,任何解决方案都被接受。 我发现XML包非常密集,有2个版本的XML数据作为列表和NodeSets ...... :-(

理想的输出看起来像这样:(所有事件(不仅仅是第3行)

event group_ID numerator denumerator Cardio-respiratory arrest E1 1 260 Cardio-respiratory arrest E2 0 250

(甚至有一个类别栏(心脏疾病) - 这将是非常理想的)

ps我用过这个问题如何将XML数据转换为data.frame? 而那个问题R列表到数据框但没有运气。 :-(

I am trying to use XML package and either xmlToList or xmlToDataFrame function. My input data is on the internet (first 2 lines) and I only need to work with certain part of the XML (see the third nodeset command)

url<- 'http://ClinicalTrials.gov/show/NCT00191100?resultsxml=true' xml = xmlTreeParse(url,useInternalNode=TRUE) ns <- getNodeSet(xml, '/clinical_study/clinical_results/reported_events/serious_events/category_list')

It is a list of categories and inside categories are “events”. And events have counts (and counts are specific to clinical trial arms (eg, drug vs. placebo arms)

I only need the events, so the best listing is here for cario-respiratory arrest using xmlToList

xl<-xmlToList(url) set2<-xl$clinical_results$reported_events$serious_events$category_list set2[[3]] > set2[[3]] $title [1] "Cardiac disorders" $event_list $event_list$event $event_list$event$sub_title [1] "Cardio-respiratory arrest" $event_list$event$counts group_id events subjects_affected subjects_at_risk "E1" "1" "1" "260" $event_list$event$counts group_id events subjects_affected subjects_at_risk "E2" "0" "0" "255"

I am not able to use xmlToDataFrame due to this error. (the nodeset2 has all data in XMLattributes and I think the xmlTODataFrame may not like this)

hopefulyDF <- getNodeSet(xml, '/clinical_study/clinical_results/reported_events/serious_events/category_list/category/event_list/event/counts') xmlToDataFrame(node = hopefulyDF) Error in matrix(vals, length(nfields), byrow = TRUE) : 'data' must be of a vector type, was 'NULL'

How to best extract the counts data? I tried unlist but I am not advanced in R enough, probably. I would like to avoid loop and manual xmlGetAttr. But in the worst case, any solution is accepted. I find the XML package very dense with 2 version of XML data as list and as NodeSets... :-(

Ideal output would look like this: (all events(not just row 3)

event group_ID numerator denumerator Cardio-respiratory arrest E1 1 260 Cardio-respiratory arrest E2 0 250

(or even have a category column (cardiac disorders) - that would be super-ideal)

p.s. I used this question How to transform XML data into a data.frame? and that question R list to data frame but with no luck. :-(

最满意答案

您可以通过迭代每个event并通过相对XPath提取counts属性来简化XML提取。 通过使用rbindlist包中的data.table ,您可以处理缺少的属性而无需添加条件代码:

library(XML) library(data.table) url <- 'http://ClinicalTrials.gov/show/NCT00191100?resultsxml=true' xml <- xmlTreeParse(url,useInternalNode=TRUE) ns <- getNodeSet(xml, '//event') rbindlist(lapply(ns, function(x) { event <- xmlValue(x) data.frame(event, t(xpathSApply(x, ".//counts", xmlAttrs))) }), fill=TRUE) ## event group_id subjects_affected events subjects_at_risk ## 1: Total, serious adverse events E1 44 NA NA ## 2: Total, serious adverse events E2 17 NA NA ## 3: Anaemia E1 6 6 260 ## 4: Anaemia E2 0 0 255 ## 5: Febrile neutropenia E1 6 6 260 ## --- ## 174: Cough E2 15 16 255 ## 175: Pruritus E1 14 16 260 ## 176: Pruritus E2 9 9 255 ## 177: Hypertension E1 19 19 260 ## 178: Hypertension E2 21 21 255

如果需要,您始终可以将其转换回data.frame和/或重命名列。

You can simplify the XML extraction by iterating over each event and extracting the counts attributes via a relative XPath. By using rbindlist from the data.table package, you can deal with the missing attributes without adding in conditional code:

library(XML) library(data.table) url <- 'http://ClinicalTrials.gov/show/NCT00191100?resultsxml=true' xml <- xmlTreeParse(url,useInternalNode=TRUE) ns <- getNodeSet(xml, '//event') rbindlist(lapply(ns, function(x) { event <- xmlValue(x) data.frame(event, t(xpathSApply(x, ".//counts", xmlAttrs))) }), fill=TRUE) ## event group_id subjects_affected events subjects_at_risk ## 1: Total, serious adverse events E1 44 NA NA ## 2: Total, serious adverse events E2 17 NA NA ## 3: Anaemia E1 6 6 260 ## 4: Anaemia E2 0 0 255 ## 5: Febrile neutropenia E1 6 6 260 ## --- ## 174: Cough E2 15 16 255 ## 175: Pruritus E1 14 16 260 ## 176: Pruritus E2 9 9 255 ## 177: Hypertension E1 19 19 260 ## 178: Hypertension E2 21 21 255

You can always convert it back to a data.frame and/or rename columns if needed.

更多推荐