Package org.apache.poi.extractor
Class ExtractorFactory
- java.lang.Object
-
- org.apache.poi.extractor.ExtractorFactory
-
public final class ExtractorFactory extends Object
Figures out the correct POIOLE2TextExtractor for your supplied document, and returns it.Note 1 - will fail for many file formats if the POI Scratchpad jar is not present on the runtime classpath
Note 2 - for text extractor creation across all formats, use
POIXMLExtractorFactory
contained within the OOXML jar.Note 3 - rather than using this, for most cases you would be better off switching to Apache Tika instead!
-
-
Field Summary
Fields Modifier and Type Field Description static String
OOXML_PACKAGE
Some OPCPackages are packed in side an OLE2 container.
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static void
addProvider(ExtractorProvider provider)
static POITextExtractor
createExtractor(File file)
Create an extractor that can be used to read text from the given file.static POITextExtractor
createExtractor(File file, String password)
Create an extractor that can be used to read text from the given file.static POITextExtractor
createExtractor(InputStream input)
Create an extractor that can be used to read text from the given file.static POITextExtractor
createExtractor(InputStream input, String password)
Create an extractor that can be used to read text from the given file.static POITextExtractor
createExtractor(DirectoryNode root)
Create the Extractor, if possible.static POITextExtractor
createExtractor(DirectoryNode root, String password)
Create the Extractor, if possible.static POITextExtractor
createExtractor(POIFSFileSystem fs)
Create an extractor that can be used to read text from the given file.static POITextExtractor
createExtractor(POIFSFileSystem fs, String password)
Create an extractor that can be used to read text from the given file.static Boolean
getAllThreadsPreferEventExtractors()
Should all threads prefer event based over usermodel based extractors? (usermodel extractors tend to be more accurate, but use more memory) Default is to use the thread level setting, which defaults to false.static POITextExtractor[]
getEmbeddedDocsTextExtractors(POIOLE2TextExtractor ext)
Returns an array of text extractors, one for each of the embedded documents in the file (if there are any).static boolean
getPreferEventExtractor()
Should this thread use event based extractors is available? Checks the all-threads one first, then thread specific.static boolean
getThreadPrefersEventExtractors()
Should this thread prefer event based over usermodel based extractors? (usermodel extractors tend to be more accurate, but use more memory) Default is false.static void
removeProvider(Class<? extends ExtractorProvider> provider)
static void
removeThreadPrefersEventExtractorsSetting()
Clears the setting for this thread made bysetThreadPrefersEventExtractors(boolean)
static void
setAllThreadsPreferEventExtractors(Boolean preferEventExtractors)
Should all threads prefer event based over usermodel based extractors? If set, will take preference over the Thread level setting.static void
setThreadPrefersEventExtractors(boolean preferEventExtractors)
Should this thread prefer event based over usermodel based extractors? Will only be used if the All Threads setting is null.
-
-
-
Field Detail
-
OOXML_PACKAGE
public static final String OOXML_PACKAGE
Some OPCPackages are packed in side an OLE2 container. If encrypted, theDirectoryNode
is called"EncryptedPackage"
, otherwise the node is called "Package"- See Also:
- Constant Field Values
-
-
Method Detail
-
getThreadPrefersEventExtractors
public static boolean getThreadPrefersEventExtractors()
Should this thread prefer event based over usermodel based extractors? (usermodel extractors tend to be more accurate, but use more memory) Default is false.- Returns:
- true if event extractors should be preferred in the current thread, false otherwise.
-
getAllThreadsPreferEventExtractors
public static Boolean getAllThreadsPreferEventExtractors()
Should all threads prefer event based over usermodel based extractors? (usermodel extractors tend to be more accurate, but use more memory) Default is to use the thread level setting, which defaults to false.- Returns:
- true if event extractors should be preferred in all threads, false otherwise.
-
setThreadPrefersEventExtractors
public static void setThreadPrefersEventExtractors(boolean preferEventExtractors)
Should this thread prefer event based over usermodel based extractors? Will only be used if the All Threads setting is null.This uses ThreadLocals and these can leak resources when you have a lot of threads.
You should always try to callremoveThreadPrefersEventExtractorsSetting()
.- Parameters:
preferEventExtractors
- If this threads should prefer event based extractors.
-
removeThreadPrefersEventExtractorsSetting
public static void removeThreadPrefersEventExtractorsSetting()
Clears the setting for this thread made bysetThreadPrefersEventExtractors(boolean)
- Since:
- POI 5.2.4
- See Also:
setThreadPrefersEventExtractors(boolean)
-
setAllThreadsPreferEventExtractors
public static void setAllThreadsPreferEventExtractors(Boolean preferEventExtractors)
Should all threads prefer event based over usermodel based extractors? If set, will take preference over the Thread level setting.- Parameters:
preferEventExtractors
- If all threads should prefer event based extractors.
-
getPreferEventExtractor
public static boolean getPreferEventExtractor()
Should this thread use event based extractors is available? Checks the all-threads one first, then thread specific.- Returns:
- If the current thread should use event based extractors.
-
createExtractor
public static POITextExtractor createExtractor(POIFSFileSystem fs) throws IOException
Create an extractor that can be used to read text from the given file.- Parameters:
fs
- The file-system which wraps the data of the file.- Returns:
- A POITextExtractor that can be used to fetch text-content of the file.
- Throws:
IOException
- If reading the file-data fails
-
createExtractor
public static POITextExtractor createExtractor(POIFSFileSystem fs, String password) throws IOException
Create an extractor that can be used to read text from the given file.- Parameters:
fs
- The file-system which wraps the data of the file.password
- The password that is necessary to open the file- Returns:
- A POITextExtractor that can be used to fetch text-content of the file.
- Throws:
IOException
- If reading the file-data fails
-
createExtractor
public static POITextExtractor createExtractor(InputStream input) throws IOException
Create an extractor that can be used to read text from the given file.- Parameters:
input
- A stream which wraps the data of the file.- Returns:
- A POITextExtractor that can be used to fetch text-content of the file.
- Throws:
IOException
- If reading the file-data failsEmptyFileException
- If the given file is empty
-
createExtractor
public static POITextExtractor createExtractor(InputStream input, String password) throws IOException
Create an extractor that can be used to read text from the given file.- Parameters:
input
- A stream which wraps the data of the file.password
- The password that is necessary to open the file- Returns:
- A POITextExtractor that can be used to fetch text-content of the file.
- Throws:
IOException
- If reading the file-data failsEmptyFileException
- If the given file is empty
-
createExtractor
public static POITextExtractor createExtractor(File file) throws IOException
Create an extractor that can be used to read text from the given file.- Parameters:
file
- The file to read- Returns:
- A POITextExtractor that can be used to fetch text-content of the file.
- Throws:
IOException
- If reading the file-data failsEmptyFileException
- If the given file is empty
-
createExtractor
public static POITextExtractor createExtractor(File file, String password) throws IOException
Create an extractor that can be used to read text from the given file.- Parameters:
file
- The file to readpassword
- The password that is necessary to open the file- Returns:
- A POITextExtractor that can be used to fetch text-content of the file.
- Throws:
IOException
- If reading the file-data failsEmptyFileException
- If the given file is empty
-
createExtractor
public static POITextExtractor createExtractor(DirectoryNode root) throws IOException
Create the Extractor, if possible. Generally needs the Scratchpad jar. Note that this won't check for embedded OOXML resources either, usePOIXMLExtractorFactory
for that.- Parameters:
root
- TheDirectoryNode
pointing to a document.- Returns:
- The resulting
POITextExtractor
, an exception is thrown if no TextExtractor can be created for some reason. - Throws:
IOException
- If converting theDirectoryNode
into a HSSFWorkbook failsOldFileFormatException
- If theDirectoryNode
points to a format of an unsupported version of Excel.IllegalArgumentException
- If creating the Extractor fails
-
createExtractor
public static POITextExtractor createExtractor(DirectoryNode root, String password) throws IOException
Create the Extractor, if possible. Generally needs the Scratchpad jar. Note that this won't check for embedded OOXML resources either, usePOIXMLExtractorFactory
for that.- Parameters:
root
- TheDirectoryNode
pointing to a document.password
- The password that is necessary to open the file- Returns:
- The resulting
POITextExtractor
, an exception is thrown if no TextExtractor can be created for some reason. - Throws:
IOException
- If converting theDirectoryNode
into a HSSFWorkbook failsOldFileFormatException
- If theDirectoryNode
points to a format of an unsupported version of Excel.IllegalArgumentException
- If creating the Extractor fails
-
getEmbeddedDocsTextExtractors
public static POITextExtractor[] getEmbeddedDocsTextExtractors(POIOLE2TextExtractor ext) throws IOException
Returns an array of text extractors, one for each of the embedded documents in the file (if there are any). If there are no embedded documents, you'll get back an empty array. Otherwise, you'll get one openPOITextExtractor
for each embedded file.- Parameters:
ext
- The extractor to look at for embedded documents- Returns:
- An array of resulting extractors. Empty if no embedded documents are found.
- Throws:
IOException
- If converting theDirectoryNode
into a HSSFWorkbook failsOldFileFormatException
- If theDirectoryNode
points to a format of an unsupported version of Excel.IllegalArgumentException
- If creating the Extractor fails
-
addProvider
public static void addProvider(ExtractorProvider provider)
-
removeProvider
public static void removeProvider(Class<? extends ExtractorProvider> provider)
-
-