使用HtmlAgilityPack+PuppeteerSharp+iText7抓取IdentityServer4帮助文档
需要学习IdentityServer4的用法,但是在IdentityServer4帮助文档网站(参考文献1)中没有找到下载离线文档的地方,准备使用HtmlAgilityPack+PuppeteerSharp+iText7将网站内容抓取生成离线PDF文档,便于本机学习、查看。
首先是分析网页结构,下图是帮助文档首页的html中左侧导航菜单的结构,从中可以看到以下几点:
1)整个导航菜单内容放在类名为wy-menu wy-menu-vertical的div元素内;
2)导航中一级菜单名称放在类名为caption的p元素内;
3)一级菜单下的二级菜单紧跟在p元素后,放在ul元素内,ul元素内的所有类名为toctree-l1的li元素内,类名为toctree-l2的li元素内保存的是更下一级的页面内导航,可以忽略。
根据上述条件,修改之前抓取SqlSugar帮助文档的程序,主要代码及程序运行效果如下所示:
HtmlAgilityPack.HtmlDocument docu = web.Load(txtUrl.Text);
HtmlNode node = docu.DocumentNode.SelectSingleNode(@"//div[@class='wy-menu wy-menu-vertical']");
HtmlNodeCollection tmpNode;
string curClass = string.Empty;
foreach (HtmlNode subNode in node.ChildNodes)
{
string className = subNode.GetAttributeValue<string>("class", string.Empty);
if ((subNode.Name=="p") && (className == "caption"))
{
curClass = subNode.InnerText;
}
if (subNode.Name== "ul")
{
tmpNode = subNode.SelectNodes(".//li[@class='toctree-l1']/a[1]");
foreach(HtmlNode n in tmpNode)
{
m_urls.Add(new LinkInfo { Module = curClass, Name = n.InnerText, Url = @"https://identityserver4.readthedocs.io/en/latest/" + n.Attributes["href"].Value.TrimStart('.') });
...
...
}
}
}
接着是生成单个PDF文档的代码及效果:
var options = new LaunchOptions { Headless = true };
using var browserFetcher = new BrowserFetcher();
await browserFetcher.DownloadAsync();
await using var browser = await Puppeteer.LaunchAsync(options);
foreach (LinkInfo url in m_urls)
{
await using var page = await browser.NewPageAsync();
await page.GoToAsync(url.Url);
PdfOptions option = new PdfOptions();
option.Format = PuppeteerSharp.Media.PaperFormat.A4;
option.Landscape = true;
await page.PdfAsync(Path.Combine(Directory.GetCurrentDirectory() + "\\papers", ($"{url.Module}_{url.Name}.pdf").Replace('/', '_')), option);
await page.DisposeAsync();
}
MessageBox.Show("生成PDF文件结束!");
最后是调用iText7合并所有PDF文档,生成带书签的IdentityServer4帮助文档的代码及效果。生成的文档已上传到CSDN博客资源中,有需要的可以自行下载。
PdfDocument pdfDoc = new PdfDocument(new PdfWriter(txtFileName.Text));
PdfMerger merger = new PdfMerger(pdfDoc);
merger.SetCloseSourceDocuments(false);
List<PdfFileInfo> pdfFiles = GetSourceDocuments();
foreach (PdfFileInfo doc in pdfFiles)
{
merger.Merge(doc.docu, 1, doc.docu.GetNumberOfPages());
}
PdfOutline rootOutline = pdfDoc.GetOutlines(false);
PdfOutline tmpOutline = null;
PdfOutline tmpSubOutline = null;
int curPageIndex = 1;
int underlineIndex = -1;
string tmpModule = "XXXXXX";
foreach (PdfFileInfo doc in pdfFiles)
{
string fileName = doc.FileName;
if (!fileName.StartsWith(tmpModule))
{
underlineIndex = fileName.IndexOf('_');
tmpModule = fileName.Substring(0, underlineIndex);
tmpOutline = rootOutline.AddOutline(tmpModule);
tmpOutline.AddDestination(PdfExplicitDestination.CreateFit(pdfDoc.GetPage(curPageIndex)));
}
tmpSubOutline = tmpOutline.AddOutline(fileName.Substring(underlineIndex + 1));
tmpSubOutline.AddDestination(PdfExplicitDestination.CreateFit(pdfDoc.GetPage(curPageIndex)));
curPageIndex += doc.docu.GetNumberOfPages();
}
pdfDoc.Close();
foreach (PdfFileInfo doc in pdfFiles)
{
doc.docu.Close();
}
参考文献:
[1]https://identityserver4.readthedocs.io/en/latest/index.html
[2]https://blog.csdn.net/Gltu_java/article/details/142656171
原文地址:https://blog.csdn.net/gc_2299/article/details/143657341
免责声明:本站文章内容转载自网络资源,如本站内容侵犯了原著者的合法权益,可联系本站删除。更多内容请关注自学内容网(zxcms.com)!