C#用正则表达式提取网页http://arxiv.org/list/astro-ph/new 里所有文章的链接，保存在数组中并存入数据库，再下载保存链接的pdf

问题1、现在已经把网页源代码保存到string里面了，想用正则表达式抓取出pdf的链接，链接部分的网页源代码为：
[<a href="/pdf/1305.3603" title="Download PDF">pdf</a>
用正则表达式把其中的/pdf/1305.3603 提取出来存入数组。
问题2、提取出来的链接是个网页版的pdf，把这个pdf下载并保存到本地的一个文件夹中。在网页源代码中打开pdf的链接，截图为：

。
补充：打开http://arxiv.org/list/astro-ph/new，右键可以查看源代码，第一篇文章pdf的链接为源代码的第56行
谢谢啦！
正则表达式 C# PDF 链接源代码 --------------------编程问答--------------------



       string txt = File.ReadAllText("1.txt",Encoding.Default);

            var arr = Regex.Matches(txt, @"(?is)<span class=""list-identifier"">\s*.*?""(/pdf/.*?)"".*?</span>").OfType<Match>().Select(x => "http://arxiv.org" + x.Groups[1]+".pdf").ToList();

            Console.WriteLine("=====数目:{0}=======",arr.Count);

            foreach (var item in arr)

            {

                Console.WriteLine(item);

            }



            Console.Read();

--------------------编程问答--------------------

引用 1 楼 nice_fish 的回复:



       string txt = File.ReadAllText("1.txt",Encoding.Default);

            var arr = Regex.Matches(txt, @"(?is)<span class=""list-identifier"">\s*.*?""(/pdf/.*?)"".*?</span>").OfType<Match>().Select(x => "http://arxiv.org" + x.Groups[1]+".pdf").ToList();

            Console.WriteLine("=====数目:{0}=======",arr.Count);

            foreach (var item in arr)

            {

                Console.WriteLine(item);

            }



            Console.Read();

这是之前获取其它信息的代码，能把获取链接的代码加在这里面吗？谢谢啦！
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
string tempStr = File.ReadAllText(@"C:\Users\myx\Desktop\Test.txt", Encoding.GetEncoding("GB2312"));//读取txt
        string pattern = @"(?i)<span[^>]*?class=(['""]?)list-identifier\1[^>]*?><a[^>]*?>[^<>\d]*?(?<ID>[\d\.]+)\s*?</a>";
        pattern += @"[\s\S]*?<div[^>]*?class=(['""]?)list-title\2[^>]*?>[\s\S]*?<span[^>]*?>[\s\S]*?</span>(?<Title>[\s\S]*?)\s*?</div>";
        pattern += @"[\s\S]*?<div[^>]*?class=(['""]?)list-authors\3[^>]*?>[\s\S]*?<span[^>]*?>[\s\S]*?</span>(?:\s*?<a[^>]*?>(?<Authors>[^<>]*?)</a>[^<>]*)+\s*?</div>";
        pattern += @"[\s\S]*?<div[^>]*?class=(['""]?)list-subjects\4[^>]*?>[\s\S]*?<span[^>]*?class=(['""]?)primary-subject\5[^>]*?>(?<Subject>[\s\S]*?)</span>[\s\S]*?</div>";
        pattern += @"[\s\S]*?<p>\s*?(?<Content>[\s\S]*?)\s*?</p>";
        foreach (Match m in Regex.Matches(tempStr, pattern))
        {
            //循环输出
            string ID = m.Groups["ID"].Value;//1305.0262
            string Title = m.Groups["Title"].Value;//On-sky characterisation of the VISTA NB118 narrow-band filters at 1.19  micron
            string Authors = string.Join("|", m.Groups["Authors"].Captures.Cast<Capture>().Select(a =>a.Value));//
            /*不同作者以|分割
             * * B. Milvang-Jensen|W. Freudling|J. Zabl|J. P. U. Fynbo|P. Moller|K. K. Nilsson|H. Joy McCracken|J. Hjorth|O. Le Fevre|L. Tasca|J. S. Dunlop|D. Sobral
             */
            string Subject = m.Groups["Subject"].Value;//Instrumentation and Methods for Astrophysics (astro-ph.IM)
            string Content = m.Groups["Content"].Value;//
        } --------------------编程问答-------------------- 代码不是已经写好了吗？至于怎么加，看你怎么样的处理逻辑了。 --------------------编程问答--------------------

引用 3 楼 nice_fish 的回复:

代码不是已经写好了吗？至于怎么加，看你怎么样的处理逻辑了。

我想知道@"(?is)<span class=""list-identifier"">\s*.*?""(/pdf/.*?)"".*?</span>").OfType<Match>().Select(x => "http://arxiv.org" + x.Groups[1]+".pdf").ToList(); 这段代码的中的.OfType<Match>().Select(x => "http://arxiv.org" + x.Groups[1]+".pdf").ToList() 是什么意思啊？ --------------------编程问答--------------------

引用 4 楼 u010509224 的回复:

Quote: 引用 3 楼 nice_fish 的回复:

代码不是已经写好了吗？至于怎么加，看你怎么样的处理逻辑了。

我想知道@"(?is)<span class=""list-identifier"">\s*.*?""(/pdf/.*?)"".*?</span>").OfType<Match>().Select(x => "http://arxiv.org" + x.Groups[1]+".pdf").ToList(); 这段代码的中的.OfType<Match>().Select(x => "http://arxiv.org" + x.Groups[1]+".pdf").ToList() 是什么意思啊？

这段的意思是：
从搜索出来的Match集合中筛选出Match类型集合，然后选出其中的匹配项。

这是扩展方法写法
如果你对这感兴趣，花上一点时间
阅读：http://www.cnblogs.com/lifepoem/archive/2011/12/16/2288017.html
--------------------编程问答--------------------

引用 1 楼 nice_fish 的回复:



       string txt = File.ReadAllText("1.txt",Encoding.Default);

            var arr = Regex.Matches(txt, @"(?is)<span class=""list-identifier"">\s*.*?""(/pdf/.*?)"".*?</span>").OfType<Match>().Select(x => "http://arxiv.org" + x.Groups[1]+".pdf").ToList();

            Console.WriteLine("=====数目:{0}=======",arr.Count);

            foreach (var item in arr)

            {

                Console.WriteLine(item);

            }



            Console.Read();

OfType<Match>().Select(x => "http://arxiv.org" + x.Groups[1]+".pdf").ToList()
这个的命名空间是什么啊， --------------------编程问答--------------------

引用 6 楼 u010509224 的回复:

Quote: 引用 1 楼 nice_fish 的回复:



       string txt = File.ReadAllText("1.txt",Encoding.Default);

            var arr = Regex.Matches(txt, @"(?is)<span class=""list-identifier"">\s*.*?""(/pdf/.*?)"".*?</span>").OfType<Match>().Select(x => "http://arxiv.org" + x.Groups[1]+".pdf").ToList();

            Console.WriteLine("=====数目:{0}=======",arr.Count);

            foreach (var item in arr)

            {

                Console.WriteLine(item);

            }



            Console.Read();

OfType<Match>().Select(x => "http://arxiv.org" + x.Groups[1]+".pdf").ToList()
这个的命名空间是什么啊，

using System.Linq;

补充：.NET技术 ,  C#