Powershell:在本地存储网站并保留对象类型

问题描述

我想离线保存网站作为对象。我在Windows10上使用Powershell 5.1.19041.546

#在线分析(有效)

$website = Invoke-WebRequest https://www.w3schools.com/html/html_tables.asp
$website | gm

#I get an  Microsoft.PowerShell.Commands.HtmlWebResponSEObject object
#next I use $website in this function (I call it Get-WebRequestTable) that expects a [Microsoft.PowerShell.Commands.HtmlWebResponSEObject] $WebRequest,input object https://www.leeholmes.com/blog/2015/01/05/extracting-tables-from-powershells-invoke-webrequest/

#offline分析在本地保存网站并使用get-content导入网站(无效)

#saving the website locally
Invoke-WebRequest -Uri  https://www.w3schools.com/html/html_tables.asp -OutFile C:\temp\website
#writing the website back to a variable
$offlinedata = Get-Content C:\temp\website
#I get a string object
$offlinedata | gm
#String can not be used in function :Get-WebRequestTable : Cannot process argument transformation on parameter 'WebRequest'. Cannot convert the "System.Object[]" value of type "System.Object[]" to type "Microsoft.PowerShell.Commands.HtmlWebResponSEObject".
Get-WebRequestTable -WebRequest $offlinedata

#离线分析将网站另存为XML(无效)

Invoke-WebRequest -Uri  https://www.w3schools.com/html/html_tables.asp  | Export-Clixml C:\temp\website.xml

运行时间很长,我得到以下XML(简称)

<Objs Version="1.1.0.1" xmlns="http://schemas.microsoft.com/powershell/2004/04">
  [...]                  <S>System.__ComObject</S>
                         <S>System.__ComObject</S>

这似乎似乎造成了一个无限循环

 <S>System.__ComObject</S>

#将其转换为json以将其存储在本地(不起作用)

$website = Invoke-WebRequest -Uri  https://www.w3schools.com/html/html_tables.asp 
$website | ConvertTo-Json

我明白了

ConvertTo-Json : An item with the same key has already been added.

有人知道如何在本地存储网站,然后再还原[Microsoft.PowerShell.Commands.HtmlWebResponSEObject]对象以进行进一步处理吗?

解决方法

此代码将本地html代码导入“ HtmlWebResponseObject”对象

function convert-localhtml($localhtmlpath){
    $HTML = New-Object -Com "HTMLFile"
    $website = Get-Content "$localhtmlpath" -raw -ErrorAction Stop
    # Write HTML content according to DOM Level2 
    $HTML.IHTMLDocument2_write($website)
    $HTML
}

对Prateek Singh https://ridicurious.com/2017/01/24/powershell-tip-parsing-html-from-a-local-file-or-a-string/

表示敬意

我稍微修改了Lee Holmes的代码,使其可以处理两种对象类型。 如果使用invoke-webrequest,则为[Microsoft.PowerShell.Commands.HtmlWebResponseObject];如果使用convert-localhtml,则为[HTMLDocumentClass]

https://www.leeholmes.com/blog/2015/01/05/extracting-tables-from-powershells-invoke-webrequest/

赞扬他出色的表格提取代码

   function Get-WebRequestTable{
        param(
            [Parameter(Mandatory = $true)]
            $WebRequest,[Parameter(Mandatory = $true)]
            [int]$TableNumber
    
        )
    
          # Ensure that a supported type was passed.
      if (($WebRequest.GetType().Name -ne "HTMLDocumentClass") -and ($WebRequest.GetType().Name -ne "HtmlWebResponseObject")) { Throw "Unsupported argument type. Need [Microsoft.PowerShell.Commands.HtmlWebResponseObject] or [HTMLDocumentClass] " }
    
      if ($WebRequest -is [Microsoft.PowerShell.Commands.HtmlWebResponseObject]) {
      $tables = @($WebRequest.ParsedHtml.getElementsByTagName("TABLE"))
      }
      else {
        #"[HTMLDocumentClass] arguments given."
        $tables = @($WebRequest.getElementsByTagName("TABLE"))
      }
        
        ## Extract the tables out of the web request
        
        $table = $tables[$TableNumber]
        $titles = @()
        $rows = @($table.Rows)
    
        ## Go through all of the rows in the table
    
        foreach ($row in $rows)
        {
            $cells = @($row.Cells)
            ## If we've found a table header,remember its titles
            if ($cells[0].tagName -eq "TH")
    
            {
    
                $titles = @($cells | ForEach-Object { ("" + $_.InnerText).Trim() })
    
                continue
    
            }
    
            ## If we haven't found any table headers,make up names "P1","P2",etc.
    
            if (-not $titles)
    
            {
    
                $titles = @(1..($cells.Count + 2) | ForEach-Object { "P$_" })
    
            }
    
            ## Now go through the cells in the the row. For each,try to find the
    
            ## title that represents that column and create a hashtable mapping those
    
            ## titles to content
    
            $resultObject = [Ordered]@{}
    
            for ($counter = 0; $counter -lt $cells.Count; $counter++)
    
            {
    
                $title = $titles[$counter]
    
                if (-not $title) { continue }
    
    
    
                $resultObject[$title] = ("" + $cells[$counter].InnerText).Trim()
    
            }
    
            ## And finally cast that hashtable to a PSCustomObject
    
            [pscustomobject]$resultObject
    
        }
    
    }

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...