许愿

GitBook 制作 Kindle 电子书详细教程(可视化版)

给这篇文章写一条留言

提示:带 * 标记的是必填项。您填写的邮箱地址将会被保密。首次留言将会在通过人工审核后显示。如果是提出问题,请务必提供尽可能多信息,这有助于他人更好地理解你所提出的问题。

小伙伴们写下了 17 条留言

  1. 您好!新年吉祥!
    感谢您的网站内容,想请教一下,如何快速批量导入多层次的html网站制作电子书?

    现有这样的网站内容,都是html,http://www.lingshh.com/
    但是分了好几个目录层次的
    希望全站下载之后,快速批量制作成带有母子目录层次结构的电子书,用什么工具最快呢?

    感谢!

    • 谢谢!这个网站只有两三个层级,但是我试着这个代码,抓取不成功……

      #!/usr/bin/env python
      # -*- coding:utf-8 -*-
      
      from calibre.web.feeds.recipes import BasicNewsRecipe # 引入 Recipe 基础类
      
      class Wang_Yin_Blog(BasicNewsRecipe): # 继承 BasicNewsRecipe 类的新类名
      
          #///////////////////
          # 设置电子书元数据
          #///////////////////
          title = '吴越佛教' # 电子书名
          description = u'吴越佛教' # 电子书简介
          #cover_url = '' # 电子书封面
          #masthead_url = '' # 页头图片
          __author__ = '杭州佛学院' # 作者
          language = 'zh' # 语言
          encoding = 'utf-8' # 编码
      
          #///////////////////
          # 抓取页面内容设置
          #///////////////////
          #keep_only_tags = [{ 'class': 'example' }] # 仅保留指定选择器包含的内容
          no_stylesheets = True # 去除 CSS 样式
          remove_javascript = True # 去除 JavaScript 脚本
          auto_cleanup = True # 自动清理 HTML 代码
          delay = 5 # 抓取页面间隔秒数
          max_articles_per_feed = 1999 # 抓取文章数量
      
          #///////////////////
          # 页面内容解析方法
          #///////////////////
          def parse_index(self):
              site = 'http://www.lingshh.com/' # 页面列表页
              soup = self.index_to_soup(site) # 解析列表页返回 BeautifulSoup 对象
              links = soup.findAll("li",{"class":"list-group-item title"}) # 获取所有文章链接
              articles = [] # 定义空文章资源数组
              for link in links: # 循环处理所有文章链接
                  title = link.a.contents[0].strip() # 提取文章标题
                  url = site + link.a.get("href") # 提取文章链接
                  a = {'title': title , 'url':url} # 组合标题和链接
                  articles.append(a) # 累加到数组中
              ans = [(self.title, articles)] # 组成最终的数据结构
              return ans # 返回可供 Calibre 转换的数据结构
      • 那篇文章只是提供了一个示例,不能直接套用的,因为每个网站的 HTML 代码结构不一样,所以需要根据具体的结构编写脚本,这需要你懂点 Python。

        我看了一下你提供的这个网站,内容类型比较杂乱,有 HTML 格式还有 DOC 格式,DOC 格式不能直接抓取,除非你使用代码单独对其进行处理。下面这个代码可以抓取所有的 HTML 页面,但是会忽略 DOC 格式:

        #!/usr/bin/env python
        # -*- coding:utf-8 -*-
        
        import re
        from urlparse import urljoin, urlsplit
        from bs4 import BeautifulSoup
        
        from calibre.web.feeds.recipes import BasicNewsRecipe
        
        class ZhenRuShi(BasicNewsRecipe):
            title = '真如是'
            description = u'真如是'
            #cover_url = ''
            #masthead_url = ''
            __author__ = '真如是'
            language = 'zh'
            # encoding = 'gb2312'
        
            no_stylesheets = True
            remove_javascript = True
        
            # delay = 3
            timeout = 60
        
        
            def parse_index(self):
                ans = []
                site = 'http://www.lingshh.com'
                soup = self.pageListSoup(site)
                topics = soup.find_all('a', href=re.compile('.*\.htm$'))
                for topic in topics:
                    topic_title = topic.get_text().strip()
                    topic_url = urljoin(site, topic.get('href'))
                    soup = self.pageListSoup(topic_url)
                    links = soup.find_all('a', href=re.compile('.*\.htm$'))
                    articles = []
                    for link in links:
                        title = link.get_text().strip()
                        url = urljoin(topic_url, urlsplit(link.get('href')).path)
                        a = {'title': title , 'url':url}
                        articles.append(a)
                    if articles:
                        ans.append((topic_title, articles))
                return ans
        
        
            def pageListSoup(self, url):
                try:
                    result = self.index_to_soup(url, raw=True)
                except Exception as e:
                    return self.log.error('Fetch article failed(%s):%s' % (e, url))
                return BeautifulSoup(result, 'lxml')

        当然这个脚本抓取的结果也是比较粗糙的,想要更精细的控制内容,就需要用添加 Python 代码慢慢细化了。

        • 感谢您!
          只需要html即可,不需要doc等文件:)
          但是尝试运行,仍然失败了

          抓取新闻,来源于 真如是

          Conversion options changed from defaults:
          verbose: 2
          Resolved conversion options
          calibre version: 3.42.0
          {‘asciiize’: False,
          ‘author_sort’: None,
          ‘authors’: None,
          ‘base_font_size’: 0,
          ‘book_producer’: None,
          ‘change_justification’: ‘original’,
          ‘chapter’: None,
          ‘chapter_mark’: ‘pagebreak’,
          ‘comments’: None,
          ‘cover’: None,
          ‘debug_pipeline’: None,
          ‘dehyphenate’: True,
          ‘delete_blank_paragraphs’: True,
          ‘disable_font_rescaling’: False,
          ‘dont_download_recipe’: False,
          ‘dont_split_on_page_breaks’: True,
          ‘duplicate_links_in_toc’: False,
          ’embed_all_fonts’: False,
          ’embed_font_family’: None,
          ‘enable_heuristics’: False,
          ‘epub_flatten’: False,
          ‘epub_inline_toc’: False,
          ‘epub_toc_at_end’: False,
          ‘epub_version’: ‘2’,
          ‘expand_css’: False,
          ‘extra_css’: None,
          ‘extract_to’: None,
          ‘filter_css’: None,
          ‘fix_indents’: True,
          ‘flow_size’: 260,
          ‘font_size_mapping’: None,
          ‘format_scene_breaks’: True,
          ‘html_unwrap_factor’: 0.4,
          ‘input_encoding’: None,
          ‘input_profile’: ,
          ‘insert_blank_line’: False,
          ‘insert_blank_line_size’: 0.5,
          ‘insert_metadata’: False,
          ‘isbn’: None,
          ‘italicize_common_cases’: True,
          ‘keep_ligatures’: False,
          ‘language’: None,
          ‘level1_toc’: None,
          ‘level2_toc’: None,
          ‘level3_toc’: None,
          ‘line_height’: 0,
          ‘linearize_tables’: False,
          ‘lrf’: False,
          ‘margin_bottom’: 5.0,
          ‘margin_left’: 5.0,
          ‘margin_right’: 5.0,
          ‘margin_top’: 5.0,
          ‘markup_chapter_headings’: True,
          ‘max_toc_links’: 50,
          ‘minimum_line_height’: 120.0,
          ‘no_chapters_in_toc’: False,
          ‘no_default_epub_cover’: False,
          ‘no_inline_navbars’: False,
          ‘no_svg_cover’: False,
          ‘output_profile’: ,
          ‘page_breaks_before’: None,
          ‘prefer_metadata_cover’: False,
          ‘preserve_cover_aspect_ratio’: False,
          ‘pretty_print’: True,
          ‘pubdate’: None,
          ‘publisher’: None,
          ‘rating’: None,
          ‘read_metadata_from_opf’: None,
          ‘remove_fake_margins’: True,
          ‘remove_first_image’: False,
          ‘remove_paragraph_spacing’: False,
          ‘remove_paragraph_spacing_indent_size’: 1.5,
          ‘renumber_headings’: True,
          ‘replace_scene_breaks’: ”,
          ‘search_replace’: None,
          ‘series’: None,
          ‘series_index’: None,
          ‘smarten_punctuation’: False,
          ‘sr1_replace’: ”,
          ‘sr1_search’: ”,
          ‘sr2_replace’: ”,
          ‘sr2_search’: ”,
          ‘sr3_replace’: ”,
          ‘sr3_search’: ”,
          ‘start_reading_at’: None,
          ‘subset_embedded_fonts’: False,
          ‘tags’: None,
          ‘test’: False,
          ‘timestamp’: None,
          ‘title’: None,
          ‘title_sort’: None,
          ‘toc_filter’: None,
          ‘toc_threshold’: 6,
          ‘toc_title’: None,
          ‘transform_css_rules’: None,
          ‘unsmarten_punctuation’: False,
          ‘unwrap_lines’: True,
          ‘use_auto_toc’: False,
          ‘verbose’: 2}
          InputFormatPlugin: Recipe Input running
          Downloading recipe urn: custom:1000
          Python function terminated unexpectedly
          No articles found, aborting (Error Code: 1)
          Traceback (most recent call last):
          File “site.py”, line 101, in main
          File “site.py”, line 78, in run_entry_point
          File “site-packages\calibre\utils\ipc\worker.py”, line 200, in main
          File “site-packages\calibre\gui2\convert\gui_conversion.py”, line 35, in gui_convert_recipe
          File “site-packages\calibre\gui2\convert\gui_conversion.py”, line 27, in gui_convert
          File “site-packages\calibre\ebooks\conversion\plumber.py”, line 1107, in run
          File “site-packages\calibre\customize\conversion.py”, line 245, in __call__
          File “site-packages\calibre\ebooks\conversion\plugins\recipe_input.py”, line 137, in convert
          File “site-packages\calibre\web\feeds\news.py”, line 1024, in download
          File “site-packages\calibre\web\feeds\news.py”, line 1203, in build_index
          ValueError: No articles found, aborting

            • 已经升级到4.9.1最新版:但是抓取新闻,还是提示错误:

              calibre, version 4.9.1 (win32, embedded-python: True)
              转换错误: 失败: 抓取新闻,来源于 真如是

              抓取新闻,来源于 真如是
              Python function terminated unexpectedly
              (Error Code: 1)
              Traceback (most recent call last):
              File “site.py”, line 114, in main
              File “site.py”, line 88, in run_entry_point
              File “site-packages\calibre\utils\ipc\worker.py”, line 209, in main
              File “site-packages\calibre\gui2\convert\gui_conversion.py”, line 36, in gui_convert_recipe
              File “site-packages\calibre\gui2\convert\gui_conversion.py”, line 28, in gui_convert
              File “site-packages\calibre\ebooks\conversion\plumber.py”, line 1049, in run
              File “site-packages\calibre\ebooks\conversion\plumber.py”, line 993, in setup_options
              File “site-packages\calibre\ebooks\conversion\plumber.py”, line 947, in read_user_metadata
              File “site-packages\calibre\ebooks\metadata\__init__.py”, line 357, in MetaInformation
              File “site-packages\calibre\ebooks\metadata\book\base.py”, line 100, in __init__
              File “site-packages\calibre\ebooks\metadata\book\formatter.py”, line 10, in
              File “site-packages\calibre\utils\formatter.py”, line 16, in
              MemoryError

              • 提示 MemoryError 可能是网站的内容比较多导致内存不足导致的,建议分批抓取。下面是更改后的代码,其中变量 feeds 是手动指定的三个文章列表,你可以按照那个格式自行添加:

                #!/usr/bin/env python
                # -*- coding:utf-8 -*-
                
                import re
                from sys import exit
                from urlparse import urljoin, urlsplit
                from bs4 import BeautifulSoup
                
                from calibre.web.feeds.recipes import BasicNewsRecipe
                
                class ZhenRuShi(BasicNewsRecipe):
                    title = '真如是'
                    description = u'真如是'
                    #cover_url = ''
                    #masthead_url = ''
                    __author__ = '真如是'
                    language = 'zh'
                    # encoding = 'gb2312'
                
                    no_stylesheets = True
                    remove_javascript = True
                
                    # delay = 3
                    timeout = 60
                
                    feeds = [
                        ('蔡日新文集', 'http://www.lingshh.com/cairx/mulu.htm'),
                        ('灵山寺志', 'http://www.lingshh.com/lingshansizhi/mulu.htm'),
                        ('乾明寺', 'http://www.lingshh.com/qianmingsi/mulu.htm'),
                        # Add other feeds
                    ]
                
                
                    def parse_index(self):
                        ans = []
                        for feed in self.feeds:
                            topic_url = feed[1]
                            soup = self.pageListSoup(topic_url)
                            links = soup.find_all('a', href=re.compile('.*\.htm$'))
                            articles = []
                            for link in links:
                                title = link.get_text().strip()
                                url = urljoin(topic_url, urlsplit(link.get('href')).path)
                                a = {'title': title , 'url':url}
                                articles.append(a)
                            if articles:
                                ans.append((feed[0], articles))
                        return ans
                
                
                    def pageListSoup(self, url):
                        try:
                            result = self.index_to_soup(url, raw=True)
                        except Exception as e:
                            return self.log.error('Fetch article failed(%s):%s' % (e, url))
                        return BeautifulSoup(result, 'html5lib')
                • 成功了!非常棒!感恩您:)

                  不过,我是手工制作这些文件名的……

                  如何才能自动让软件识别网页上的 链接名字 + 对应的地址,得到这种格式呢?

                  (‘蔡日新文集’, ‘http://www.lingshh.com/cairx/mulu.htm’),

                • 就是手动指定的。如果你想一次性把链接提取出来,可以自己写点代码提取,下面是提取好的,你直接添加就行了:

                  ('刚晓文集', 'http://www.lingshh.com/gxwj/mulu.htm'),
                  ('蔡日新文集', 'http://www.lingshh.com/cairx/mulu.htm'),
                  ('灵山寺志', 'http://www.lingshh.com/lingshansizhi/mulu.htm'),
                  ('乾明寺', 'http://www.lingshh.com/qianmingsi/mulu.htm'),
                  ('灵山佛学研究会论文选第一集', 'http://www.lingshh.com/dalingshan/dlsmulu.htm'),
                  ('灵山佛学研究会论文选第二集', 'http://www.lingshh.com/dls2/dls2mulv.htm'),
                  ('灵山佛学研究会论文选第三集', 'http://www.lingshh.com/dls3/mulv.htm'),
                  ('第一届吴越佛教研讨会论文', 'http://www.lingshh.com/wuyuefojiao/mulu.htm'),
                  ('第二届吴越佛教研讨会论文', 'http://www.lingshh.com/2jiewyfj/mulu.htm'),
                  ('第三届吴越佛教研讨会论文', 'http://www.lingshh.com/3jiewyfj/mulu.htm'),
                  ('吴越佛教(第一卷)', 'http://www.lingshh.com/wyfjdyj/mulu.htm'),
                  ('吴越佛教(第二卷)', 'http://www.lingshh.com/wyfjdej/mulu.htm'),
                  ('吴越佛教(第三卷)', 'http://www.lingshh.com/wyfjdsj/mulu.htm'),
                  ('首届国际因明学术研讨会论文', 'http://www.lingshh.com/yinmingyth/mulu.htm'),
                  ('第六届吴越佛教唯识研讨会', 'http://www.lingshh.com/weishiyantaohui/mulu.htm'),
                  ('第九届吴越佛教研讨会论文', 'http://www.lingshh.com/wyfjdjj/mulu.htm'),
                  ('十一届吴越佛教', 'http://www.lingshh.com/11jiewyfj/mulu.htm'),
                  ('十三届吴越佛教', 'http://www.lingshh.com/13jwyfj/mulu.htm'),
                  ('十四届吴越佛教', 'http://www.lingshh.com/14jwyfj/14jwyfjmulu.htm'),
                  ('吴越佛教(第五卷)', 'http://www.lingshh.com/wyfjdwj/mulu.htm'),
                  ('因明(第一辑)', 'http://www.lingshh.com/yinming1/mulu.htm'),
                  ('因明(第二辑)', 'http://www.lingshh.com/yinming2/mulu.htm'),
                  ('因明(第三辑)', 'http://www.lingshh.com/yinming3/mulu.htm'),
                  ('第一期', 'http://www.lingshh.com/1-a\1-1.htm'),
                  ('第二期', 'http://www.lingshh.com/2-a\2-1.htm'),
                  ('第三期', 'http://www.lingshh.com/3-a\3-1.htm'),
                  ('第四期', 'http://www.lingshh.com/4-a/4-1.htm'),
                  ('第五期', 'http://www.lingshh.com/5-a/5-1.htm'),
                  ('第六期', 'http://www.lingshh.com/6-a/6_ml.htm'),
                  ('第七期', 'http://www.lingshh.com/7-a/mulu.htm'),
                  ('第八期', 'http://www.lingshh.com/8-a/8mulu.htm'),
                  ('第九期', 'http://www.lingshh.com/9-a/9mulu.htm'),
                  ('第十期', 'http://www.lingshh.com/10-a/10mulu.htm'),
                  ('十一期', 'http://www.lingshh.com/11-a/mulu.htm'),
                  ('十二期', 'http://www.lingshh.com/12-a/mulu.htm'),
                  ('十三期', 'http://www.lingshh.com/13-a/mulu.htm'),
                  ('十四期', 'http://www.lingshh.com/14-a/mulu.htm'),
                  ('十五期', 'http://www.lingshh.com/15-a/mulu.htm'),
                  ('十六期', 'http://www.lingshh.com/16-a/mulu.htm'),
                  ('十七期', 'http://www.lingshh.com/17-a/mulu.htm'),
                  ('十八期', 'http://www.lingshh.com/18-a/mulu.htm'),
                  ('十九期', 'http://www.lingshh.com/19-a/mulu.htm'),
                  ('二十期', 'http://www.lingshh.com/20-a/mulu.htm'),
                  ('廿一期', 'http://www.lingshh.com/21-a/mulu.htm'),
                  ('廿二期', 'http://www.lingshh.com/22-a/mulu.htm'),
                  ('廿三期', 'http://www.lingshh.com/23/mulu.htm'),
                  ('廿四期', 'http://www.lingshh.com/24/mulu.htm'),
                  ('廿五期', 'http://www.lingshh.com/25/mulu.htm'),
                  ('廿六期', 'http://www.lingshh.com/26/mulu.htm'),
                  ('廿七期', 'http://www.lingshh.com/27/mulu.htm'),
                  ('廿八期', 'http://www.lingshh.com/28/mulu.htm'),
                  ('廿九期', 'http://www.lingshh.com/29/mulu.htm'),
                  ('三十期', 'http://www.lingshh.com/30/mulu.htm'),
                  ('卅一期', 'http://www.lingshh.com/31/mulu.htm'),
                  ('卅二期', 'http://www.lingshh.com/32/mulu.htm'),
                  ('卅三期', 'http://www.lingshh.com/33/mulu.htm'),
                  ('卅四期', 'http://www.lingshh.com/34/mulu.htm'),
                  ('卅五期', 'http://www.lingshh.com/2010lshh/mulu1.htm'),
                  ('卅六期', 'http://www.lingshh.com/36/mulu.htm'),
                  ('卅七期', 'http://www.lingshh.com/37/mulu.htm'),
                  ('卅九期', 'http://www.lingshh.com/39/mulu.htm'),
                  ('四十期', 'http://www.lingshh.com/40/mulu.htm'),
                  ('四一期', 'http://www.lingshh.com/41/mulu.htm'),
                  ('四二期', 'http://www.lingshh.com/42/mulu.htm'),
                  ('四三期', 'http://www.lingshh.com/43/mulu.htm'),
                  ('四四期', 'http://www.lingshh.com/44/mulu.htm'),
                • 谢谢!

                  不过可能内容太多了,获取新闻,迟迟无法结束……

                  以后还会遇到很多这种网页,需要自动识别网页上的 链接名字 + 对应的地址,如何才能实现呢?

                  (‘蔡日新文集’, ‘http://www.lingshh.com/cairx/mulu.htm’),

                • 可能你没理解我的意思。我给你的列表是方便你手动添加,不是一次性添加进去,否则就和一开始给你的那个抓取页面所有链接的代码作用是一样的了,因为内容过多,最后还是可能会遇到内存错误。给你的建议还是之前说过的,手动设置链接,分批抓取。

                  对于不同的网页,不同的内容结构,就需要对脚本做相应的调整,没有一种一劳永逸能适配所有的网页内容的脚本。

  2. 感謝教學!
    補充一下,兩個月前電子書發佈功能已經預設為關閉了,所以要到自己的GitBook首頁,點選書本右邊的三角鍵頭→設定書本,然後把電子書功能打開,這樣才有epub, mobi或PDF可以下載喔!
    詳見官方說明
    https://www.gitbook.com/blog/features/ebooks-option

  3. 用gitbook导出的pdf电子书中文字体一大一小,歪歪斜斜,您知道怎么解决吗?

返回到顶部