Package org.apache.pdfbox.tools
Class PDFText2HTML
java.lang.Object
org.apache.pdfbox.contentstream.PDFStreamEngine
org.apache.pdfbox.text.PDFTextStripper
org.apache.pdfbox.tools.PDFText2HTML
public class PDFText2HTML
extends org.apache.pdfbox.text.PDFTextStripper
Wrap stripped text in simple HTML, trying to form HTML paragraphs. Paragraphs
broken by pages, columns, or figures are not mended.
- Author:
- John J Barton
-
Field Summary
Fields inherited from class org.apache.pdfbox.text.PDFTextStripper
charactersByArticle, document, LINE_SEPARATOR, output
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprotected float
computeFontHeight
(org.apache.pdfbox.pdmodel.font.PDFont arg0) protected void
Write out the article separator.void
endDocument
(org.apache.pdfbox.pdmodel.PDDocument document) protected String
getTitle()
This method will attempt to guess the title of the document using either the document properties or the first lines of text.protected void
showGlyph
(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, String arg3, org.apache.pdfbox.util.Vector arg4) protected void
startArticle
(boolean isLTR) Write out the article separator (div tag) with proper text direction information.protected void
startDocument
(org.apache.pdfbox.pdmodel.PDDocument document) protected void
Deprecated.protected void
Writes the paragraph end "</p>" to the output.protected void
writeString
(String chars) Write a string to the output stream and escape some HTML characters.protected void
writeString
(String text, List<org.apache.pdfbox.text.TextPosition> textPositions) Write a string to the output stream, maintain font state, and escape some HTML characters.Methods inherited from class org.apache.pdfbox.text.PDFTextStripper
endPage, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCharactersByArticle, getCurrentPageNo, getDropThreshold, getEndBookmark, getEndPage, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getStartPage, getSuppressDuplicateOverlappingText, getText, getWordSeparator, matchPattern, processPage, processPages, processTextPosition, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndBookmark, setEndPage, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setStartBookmark, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startPage, writeCharacters, writeLineSeparator, writePage, writePageEnd, writePageStart, writeParagraphSeparator, writeParagraphStart, writeText, writeWordSeparator
Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine
addOperator, applyTextAdjustment, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, registerOperatorProcessor, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showFontGlyph, showForm, showGlyph, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
-
Constructor Details
-
PDFText2HTML
Constructor.- Throws:
IOException
- If there is an error during initialization.
-
-
Method Details
-
writeHeader
Deprecated.Write the header to the output document. Now also writes the tag defining the character encoding.- Throws:
IOException
- If there is a problem writing out the header to the document.
-
startDocument
- Overrides:
startDocument
in classorg.apache.pdfbox.text.PDFTextStripper
- Throws:
IOException
-
endDocument
- Overrides:
endDocument
in classorg.apache.pdfbox.text.PDFTextStripper
- Throws:
IOException
-
getTitle
This method will attempt to guess the title of the document using either the document properties or the first lines of text.- Returns:
- returns the title.
-
startArticle
Write out the article separator (div tag) with proper text direction information.- Overrides:
startArticle
in classorg.apache.pdfbox.text.PDFTextStripper
- Parameters:
isLTR
- true if direction of text is left to right- Throws:
IOException
- If there is an error writing to the stream.
-
endArticle
Write out the article separator.- Overrides:
endArticle
in classorg.apache.pdfbox.text.PDFTextStripper
- Throws:
IOException
- If there is an error writing to the stream.
-
writeString
protected void writeString(String text, List<org.apache.pdfbox.text.TextPosition> textPositions) throws IOException Write a string to the output stream, maintain font state, and escape some HTML characters. The font state is only preserved per word.- Overrides:
writeString
in classorg.apache.pdfbox.text.PDFTextStripper
- Parameters:
text
- The text to write to the stream.textPositions
- the corresponding text positions- Throws:
IOException
- If there is an error writing to the stream.
-
writeString
Write a string to the output stream and escape some HTML characters.- Overrides:
writeString
in classorg.apache.pdfbox.text.PDFTextStripper
- Parameters:
chars
- String to be written to the stream- Throws:
IOException
- If there is an error writing to the stream.
-
writeParagraphEnd
Writes the paragraph end "</p>" to the output. Furthermore, it will also clear the font state.- Overrides:
writeParagraphEnd
in classorg.apache.pdfbox.text.PDFTextStripper
- Throws:
IOException
-
showGlyph
protected void showGlyph(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, String arg3, org.apache.pdfbox.util.Vector arg4) throws IOException - Overrides:
showGlyph
in classorg.apache.pdfbox.contentstream.PDFStreamEngine
- Throws:
IOException
-
computeFontHeight
- Throws:
IOException
-
startDocument(PDDocument)