Class WebcrawlerConnector
- java.lang.Object
-
- org.apache.manifoldcf.core.connector.BaseConnector
-
- org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
-
- org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector
-
- All Implemented Interfaces:
org.apache.manifoldcf.core.interfaces.IConnector,org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
public class WebcrawlerConnector extends org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnectorThis is the Web Crawler implementation of the IRepositoryConnector interface. This connector may be superceded by one that calls out to python, or by a entirely python Connector Framework, depending on how the winds blow.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description protected static classWebcrawlerConnector.CanonicalizationPoliciesClass representing a list of canonicalization rulesprotected static classWebcrawlerConnector.CanonicalizationPolicyClass representing a URL regular expression match, for the purposes of determining canonicalization policyprotected classWebcrawlerConnector.DocumentURLFilterThis class describes the url filtering information (for crawling and indexing) obtained from a digested DocumentSpecification.protected static classWebcrawlerConnector.EvaluatorTokenEvaluator token.protected static classWebcrawlerConnector.EvaluatorTokenStreamToken stream.protected classWebcrawlerConnector.FeedContextClassprotected classWebcrawlerConnector.FeedItemContextClassprotected static classWebcrawlerConnector.FetchStatusprotected static classWebcrawlerConnector.MappingRuleClass representing a mapping ruleprotected static classWebcrawlerConnector.MappingRulesClass that represents all mappingsprotected static classWebcrawlerConnector.NameValueName/value classprotected classWebcrawlerConnector.OuterContextClassThis class handles the outermost XML context for the feed document.protected classWebcrawlerConnector.ProcessActivityHTMLHandlerClass that describes HTML handlingprotected classWebcrawlerConnector.ProcessActivityLinkHandlerThis class is the handler for links that get added into a IProcessActivity object.protected classWebcrawlerConnector.ProcessActivityRedirectionHandlerClass that describes redirection handlingprotected classWebcrawlerConnector.ProcessActivityXMLHandlerClass that describes XML handlingprotected classWebcrawlerConnector.RDFContextClassprotected classWebcrawlerConnector.RDFItemContextClassprotected classWebcrawlerConnector.RSSChannelContextClassprotected classWebcrawlerConnector.RSSContextClassprotected classWebcrawlerConnector.RSSItemContextClassprotected classWebcrawlerConnector.UrlsetContextClassprotected classWebcrawlerConnector.UrlsetItemContextClass
-
Field Summary
Fields Modifier and Type Field Description static java.lang.String_rcsidstatic java.lang.StringACTIVITY_FETCHstatic java.lang.StringACTIVITY_LOGON_ENDstatic java.lang.StringACTIVITY_LOGON_STARTstatic java.lang.StringACTIVITY_PROCESSstatic java.lang.StringACTIVITY_ROBOTSPARSEprotected static DataCachecacheThis is where we keep data around between the getVersions() phase and the processDocuments() phase.protected intconnectionTimeoutMillisecondsConnection timeout, milliseconds.protected CookieManagercookieManagerThe cookie manager used by this instanceprotected CredentialsDescriptioncredentialsDescriptionThe credentials descriptionprotected DNSManagerdnsManagerThe DNS manager currently used by this instanceprotected static java.lang.StringFETCH_LOGINprotected static java.lang.StringFETCH_ROBOTSprotected static java.lang.StringFETCH_STANDARDprotected java.lang.StringfromThe email address for this connector instanceprotected static java.lang.String[]interestingMimeTypeArrayThis represents a list of the mime types that this connector knows how to extract links from.protected static java.util.Set<java.lang.String>interestingMimeTypeMapprotected booleanisInitializedThis flag is set when the instance has been initializedprotected static intMETA_ROBOTS_ALLprotected static intMETA_ROBOTS_NONEprotected intmetaRobotsTagsUsageMeta robots tag usage flagprotected static java.util.List<java.lang.String>potentiallyExcludedHeadersprotected java.lang.StringproxyAuthDomainProxy auth domainprotected java.lang.StringproxyAuthPasswordProxy auth passwordprotected java.lang.StringproxyAuthUsernameProxy auth user nameprotected java.lang.StringproxyHostProxy hostprotected intproxyPortProxy portstatic java.lang.StringREL_LINKstatic java.lang.StringREL_REDIRECTprotected static java.util.Set<java.lang.String>reservedHeadersprotected static intRESULT_NO_DOCUMENTprotected static intRESULT_NO_VERSIONprotected static intRESULT_RETRY_DOCUMENTprotected static intRESULT_VERSION_NEEDEDprotected static intRESULTSTATUS_FALSEprotected static intRESULTSTATUS_NOTYETDETERMINEDprotected static intRESULTSTATUS_TRUEprotected static intROBOTS_ALLprotected static intROBOTS_DATAprotected static intROBOTS_NONEprotected RobotsManagerrobotsManagerThe robots manager currently used by this instanceprotected introbotsUsageRobots usage flagprotected static intSESSIONSTATE_LOGINWe're in 'login mode'protected static intSESSIONSTATE_NORMALNormal fetch of content document.protected intsocketTimeoutMillisecondsSocket timeout, millisecondsprotected ThrottleDescriptionthrottleDescriptionThe throttle descriptionprotected java.lang.StringthrottleGroupNameThrottle group nameprotected TrustsDescriptiontrustsDescriptionThe trusts descriptionprotected static java.util.Set<java.lang.String>understoodProtocolsprotected java.lang.StringuserAgentThe user-agent for this connector instance-
Fields inherited from class org.apache.manifoldcf.core.connector.BaseConnector
currentContext, params
-
Fields inherited from interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
GLOBAL_DENY_TOKEN, JOBMODE_CONTINUOUS, JOBMODE_ONCEONLY, MODEL_ADD, MODEL_ADD_CHANGE, MODEL_ADD_CHANGE_DELETE, MODEL_ALL, MODEL_CHAINED_ADD, MODEL_CHAINED_ADD_CHANGE, MODEL_CHAINED_ADD_CHANGE_DELETE, MODEL_PARTIAL
-
-
Constructor Summary
Constructors Constructor Description WebcrawlerConnector()Constructor.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description java.lang.StringaddSeedDocuments(org.apache.manifoldcf.crawler.interfaces.ISeedingActivity activities, org.apache.manifoldcf.core.interfaces.Specification spec, java.lang.String lastSeedVersion, long seedTime, int jobMode)Queue "seed" documents.protected java.lang.String[]calculateDocumentEvents(org.apache.manifoldcf.crawler.interfaces.INamingActivity activities, java.lang.String documentIdentifier)Calculate events that should be associated with a document.java.lang.Stringcheck()Check status of connection.protected intcheckFetchAllowed(java.lang.String documentIdentifier, java.lang.String protocol, java.lang.String hostIPAddress, int port, PageCredentials credential, org.apache.manifoldcf.connectorcommon.interfaces.IKeystoreManager trustStore, java.lang.String hostName, java.lang.String[] binNames, long currentTime, java.lang.String pathString, org.apache.manifoldcf.crawler.interfaces.IProcessActivity versionActivities, int connectionLimit, java.lang.String proxyHost, int proxyPort, java.lang.String proxyAuthDomain, java.lang.String proxyAuthUsername, java.lang.String proxyAuthPassword)Check robots to see if fetch is allowed.voidclearThreadContext()Clear out any state information specific to a given thread.protected static voidcompileList(java.util.List<java.util.regex.Pattern> output, java.util.List<java.lang.String> input)Compile all regexp entries in the passed in list, and add them to the output list.voiddeinstall(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext)Uninstall the connector.voiddisconnect()Close the connection.protected java.lang.StringdoCanonicalization(WebcrawlerConnector.DocumentURLFilter filter, WebURL url)Code to canonicalize a URL.protected java.lang.StringdocumentIdentifiertoFileName(java.lang.String documentIdentifier)Convert a document identifier to filename.protected static java.lang.StringextractContentType(java.lang.String contentType)protected static java.lang.StringextractEncoding(java.lang.String contentType)protected booleanextractLinks(java.lang.String documentIdentifier, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, WebcrawlerConnector.DocumentURLFilter filter)Code to extract links from an already-fetched document.protected static java.lang.StringextractMimeType(java.lang.String contentType)protected static java.util.Set<java.lang.String>findExcludedHeaders(org.apache.manifoldcf.core.interfaces.Specification spec)Read a document specification to get a set of excluded headersprotected FormDatafindHTMLForm(java.lang.String currentURI, LoginParameters lp)Find matching HTML form data, if present.protected java.lang.StringfindHTMLLinkURI(java.lang.String currentURI, LoginParameters lp)Find HTML link URI, if present, making sure specified preference is matched.protected java.lang.StringfindPreferredRedirectionURI(java.lang.String currentURI, LoginParameters lp)Find a preferred redirection URI, if it existsprotected java.lang.StringfindRedirectionURI(java.lang.String currentURI)Find a redirection URI, if it existsprotected java.lang.StringfindSpecifiedContent(java.lang.String currentURI, LoginParameters lp)Find existence of specific content on the page (never finds a URL)protected static java.lang.String[]getAcls(org.apache.manifoldcf.core.interfaces.Specification spec)Grab forced acl out of document specification.java.lang.String[]getActivitiesList()Return the list of activities that this connector supports (i.e.java.lang.String[]getBinNames(java.lang.String documentIdentifier)Get the bin name string for a document identifier.intgetConnectorModel()Tell the world what model this connector uses for getDocumentIdentifiers().static java.lang.StringgetFinalURL(java.lang.String url)If the initial url is permanently or temporarly redirected (code 301 or 302), the method returns the destination urlintgetMaxDocumentRequest()Get the maximum number of documents to amalgamate together into one batch, for this connector.protected PageCredentialsgetPageCredential(java.lang.String documentIdentifier)Get the page credentials for a given document identifier (URL)java.lang.String[]getRelationshipTypes()Return the list of relationship types that this connector recognizes.protected SequenceCredentialsgetSequenceCredential(java.lang.String documentIdentifier)Get the sequence credentials for a given document identifier (URL)protected voidgetSession()Start a sessionprotected org.apache.manifoldcf.connectorcommon.interfaces.IKeystoreManagergetTrustStore(java.lang.String documentIdentifier)Get the trust store for a given document identifier (URL)protected voidhandleHTML(java.lang.String documentURI, IHTMLHandler handler)Handle document references from HTMLprotected static voidhandleIOException(java.io.IOException e, java.lang.String context)protected voidhandleRedirects(java.lang.String documentURI, IRedirectionHandler handler)Handle extracting the redirect link from a redirect response.protected voidhandleXML(java.lang.String documentURI, IXMLHandler handler)Handle document references from XML.voidinstall(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext)Install the connector.protected booleanisContentInteresting(org.apache.manifoldcf.crawler.interfaces.IFingerprintActivity activities, java.lang.String documentIdentifier, int response, java.lang.String contentType)Code to check if data is interesting, based on response code and content type.protected booleanisDocumentText(java.lang.String documentURI)Is the document text, as far as we can tell?protected static booleanisStrange(byte x)Check if character is not typical ASCII or utf-8.protected static booleanisText(byte[] beginChunk, int chunkLength)Test to see if a document is text or not.protected static booleanisWhiteSpace(byte x)Check if a byte is a whitespace character.protected voidloginAndFetch(WebcrawlerConnector.FetchStatus fetchStatus, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, java.lang.String documentIdentifier, SequenceCredentials sessionCredential, java.lang.String globalSequenceEvent)protected intlookupIPAddress(java.lang.String documentIdentifier, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, java.lang.String hostName, long currentTime, java.lang.StringBuilder ipAddressBuffer)Look up an ipaddress given a non-canonical host name.protected java.lang.StringmakeDNSEventName(org.apache.manifoldcf.crawler.interfaces.INamingActivity activities, java.lang.String hostNameKey)Calculate the event name for DNS access.protected java.lang.StringmakeDocumentIdentifier(java.lang.String parentIdentifier, java.lang.String rawURL, WebcrawlerConnector.DocumentURLFilter filter, org.apache.manifoldcf.crawler.interfaces.IHistoryActivity activities)Convert an absolute or relative URL to a document identifier.protected java.lang.StringmakeRobotsEventName(org.apache.manifoldcf.crawler.interfaces.INamingActivity versionActivities, java.lang.String robotsKey)Construct a name for the global web-connector robots event.protected static java.lang.StringmakeRobotsKey(java.lang.String protocol, java.lang.String hostName, int port)Construct the robots key for a host.protected java.lang.StringmakeSessionLoginEventName(org.apache.manifoldcf.crawler.interfaces.INamingActivity activities, java.lang.String sequenceKey)Calculate the event name for session login.voidoutputConfigurationBody(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.ConfigParams parameters, java.lang.String tabName)Output the configuration body section.voidoutputConfigurationHeader(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.ConfigParams parameters, java.util.List<java.lang.String> tabsArray)Output the configuration header section.voidoutputSpecificationBody(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.Specification ds, int connectionSequenceNumber, int actualSequenceNumber, java.lang.String tabName)Output the specification body section.voidoutputSpecificationHeader(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.Specification ds, int connectionSequenceNumber, java.util.List<java.lang.String> tabsArray)Output the specification header section.voidpoll()This method is periodically called for all connectors that are connected but not in active use.java.lang.StringprocessConfigurationPost(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IPostParameters variableContext, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.ConfigParams parameters)Process a configuration post.protected voidprocessDocument(org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, java.lang.String documentIdentifier, java.lang.String versionString, boolean indexDocument, java.util.Map<java.lang.String,java.util.Set<java.lang.String>> metaHash, java.lang.String[] acls, WebcrawlerConnector.DocumentURLFilter filter)voidprocessDocuments(java.lang.String[] documentIdentifiers, org.apache.manifoldcf.crawler.interfaces.IExistingVersions statuses, org.apache.manifoldcf.core.interfaces.Specification spec, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, int jobMode, boolean usesDefaultAuthority)Process a set of documents.java.lang.StringprocessSpecificationPost(org.apache.manifoldcf.core.interfaces.IPostParameters variableContext, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.Specification ds, int connectionSequenceNumber)Process a specification post.protected static java.util.List<java.lang.String>stringToArray(java.lang.String input)Read a string as a sequence of individual expressions, urls, etc.voidviewConfiguration(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.ConfigParams parameters)View configuration.voidviewSpecification(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.Specification ds, int connectionSequenceNumber)View specification.-
Methods inherited from class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
getFormCheckJavascriptMethodName, getFormPresaveCheckJavascriptMethodName, requestInfo
-
Methods inherited from class org.apache.manifoldcf.core.connector.BaseConnector
connect, getConfiguration, isConnected, outputConfigurationBody, outputConfigurationHeader, outputConfigurationHeader, pack, packFixedList, packList, packList, processConfigurationPost, setThreadContext, unpack, unpackFixedList, unpackList, viewConfiguration
-
-
-
-
Field Detail
-
_rcsid
public static final java.lang.String _rcsid
- See Also:
- Constant Field Values
-
RESULTSTATUS_FALSE
protected static final int RESULTSTATUS_FALSE
- See Also:
- Constant Field Values
-
RESULTSTATUS_TRUE
protected static final int RESULTSTATUS_TRUE
- See Also:
- Constant Field Values
-
RESULTSTATUS_NOTYETDETERMINED
protected static final int RESULTSTATUS_NOTYETDETERMINED
- See Also:
- Constant Field Values
-
interestingMimeTypeArray
protected static final java.lang.String[] interestingMimeTypeArray
This represents a list of the mime types that this connector knows how to extract links from. Documents that are indexable are described by the output connector.
-
interestingMimeTypeMap
protected static final java.util.Set<java.lang.String> interestingMimeTypeMap
-
understoodProtocols
protected static final java.util.Set<java.lang.String> understoodProtocols
-
ROBOTS_NONE
protected static final int ROBOTS_NONE
- See Also:
- Constant Field Values
-
ROBOTS_DATA
protected static final int ROBOTS_DATA
- See Also:
- Constant Field Values
-
ROBOTS_ALL
protected static final int ROBOTS_ALL
- See Also:
- Constant Field Values
-
META_ROBOTS_NONE
protected static final int META_ROBOTS_NONE
- See Also:
- Constant Field Values
-
META_ROBOTS_ALL
protected static final int META_ROBOTS_ALL
- See Also:
- Constant Field Values
-
REL_LINK
public static final java.lang.String REL_LINK
- See Also:
- Constant Field Values
-
REL_REDIRECT
public static final java.lang.String REL_REDIRECT
- See Also:
- Constant Field Values
-
ACTIVITY_FETCH
public static final java.lang.String ACTIVITY_FETCH
- See Also:
- Constant Field Values
-
ACTIVITY_PROCESS
public static final java.lang.String ACTIVITY_PROCESS
- See Also:
- Constant Field Values
-
ACTIVITY_ROBOTSPARSE
public static final java.lang.String ACTIVITY_ROBOTSPARSE
- See Also:
- Constant Field Values
-
ACTIVITY_LOGON_START
public static final java.lang.String ACTIVITY_LOGON_START
- See Also:
- Constant Field Values
-
ACTIVITY_LOGON_END
public static final java.lang.String ACTIVITY_LOGON_END
- See Also:
- Constant Field Values
-
FETCH_ROBOTS
protected static final java.lang.String FETCH_ROBOTS
- See Also:
- Constant Field Values
-
FETCH_STANDARD
protected static final java.lang.String FETCH_STANDARD
- See Also:
- Constant Field Values
-
FETCH_LOGIN
protected static final java.lang.String FETCH_LOGIN
- See Also:
- Constant Field Values
-
reservedHeaders
protected static final java.util.Set<java.lang.String> reservedHeaders
-
potentiallyExcludedHeaders
protected static final java.util.List<java.lang.String> potentiallyExcludedHeaders
-
robotsUsage
protected int robotsUsage
Robots usage flag
-
metaRobotsTagsUsage
protected int metaRobotsTagsUsage
Meta robots tag usage flag
-
userAgent
protected java.lang.String userAgent
The user-agent for this connector instance
-
from
protected java.lang.String from
The email address for this connector instance
-
connectionTimeoutMilliseconds
protected int connectionTimeoutMilliseconds
Connection timeout, milliseconds.
-
socketTimeoutMilliseconds
protected int socketTimeoutMilliseconds
Socket timeout, milliseconds
-
throttleGroupName
protected java.lang.String throttleGroupName
Throttle group name
-
throttleDescription
protected ThrottleDescription throttleDescription
The throttle description
-
credentialsDescription
protected CredentialsDescription credentialsDescription
The credentials description
-
trustsDescription
protected TrustsDescription trustsDescription
The trusts description
-
robotsManager
protected RobotsManager robotsManager
The robots manager currently used by this instance
-
dnsManager
protected DNSManager dnsManager
The DNS manager currently used by this instance
-
cookieManager
protected CookieManager cookieManager
The cookie manager used by this instance
-
isInitialized
protected boolean isInitialized
This flag is set when the instance has been initialized
-
cache
protected static DataCache cache
This is where we keep data around between the getVersions() phase and the processDocuments() phase.
-
proxyHost
protected java.lang.String proxyHost
Proxy host
-
proxyPort
protected int proxyPort
Proxy port
-
proxyAuthDomain
protected java.lang.String proxyAuthDomain
Proxy auth domain
-
proxyAuthUsername
protected java.lang.String proxyAuthUsername
Proxy auth user name
-
proxyAuthPassword
protected java.lang.String proxyAuthPassword
Proxy auth password
-
SESSIONSTATE_NORMAL
protected static final int SESSIONSTATE_NORMAL
Normal fetch of content document. (For all we know, we're logged in already).- See Also:
- Constant Field Values
-
SESSIONSTATE_LOGIN
protected static final int SESSIONSTATE_LOGIN
We're in 'login mode'- See Also:
- Constant Field Values
-
RESULT_NO_DOCUMENT
protected static final int RESULT_NO_DOCUMENT
- See Also:
- Constant Field Values
-
RESULT_NO_VERSION
protected static final int RESULT_NO_VERSION
- See Also:
- Constant Field Values
-
RESULT_VERSION_NEEDED
protected static final int RESULT_VERSION_NEEDED
- See Also:
- Constant Field Values
-
RESULT_RETRY_DOCUMENT
protected static final int RESULT_RETRY_DOCUMENT
- See Also:
- Constant Field Values
-
-
Method Detail
-
getConnectorModel
public int getConnectorModel()
Tell the world what model this connector uses for getDocumentIdentifiers(). This must return a model value as specified above.- Specified by:
getConnectorModelin interfaceorg.apache.manifoldcf.crawler.interfaces.IRepositoryConnector- Overrides:
getConnectorModelin classorg.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector- Returns:
- the model type value.
-
install
public void install(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext) throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionInstall the connector. This method is called to initialize persistent storage for the connector, such as database tables etc. It is called when the connector is registered.- Specified by:
installin interfaceorg.apache.manifoldcf.core.interfaces.IConnector- Overrides:
installin classorg.apache.manifoldcf.core.connector.BaseConnector- Parameters:
threadContext- is the current thread context.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
deinstall
public void deinstall(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext) throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionUninstall the connector. This method is called to remove persistent storage for the connector, such as database tables etc. It is called when the connector is deregistered.- Specified by:
deinstallin interfaceorg.apache.manifoldcf.core.interfaces.IConnector- Overrides:
deinstallin classorg.apache.manifoldcf.core.connector.BaseConnector- Parameters:
threadContext- is the current thread context.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
getActivitiesList
public java.lang.String[] getActivitiesList()
Return the list of activities that this connector supports (i.e. writes into the log).- Specified by:
getActivitiesListin interfaceorg.apache.manifoldcf.crawler.interfaces.IRepositoryConnector- Overrides:
getActivitiesListin classorg.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector- Returns:
- the list.
-
getRelationshipTypes
public java.lang.String[] getRelationshipTypes()
Return the list of relationship types that this connector recognizes.- Specified by:
getRelationshipTypesin interfaceorg.apache.manifoldcf.crawler.interfaces.IRepositoryConnector- Overrides:
getRelationshipTypesin classorg.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector- Returns:
- the list.
-
clearThreadContext
public void clearThreadContext()
Clear out any state information specific to a given thread. This method is called when this object is returned to the connection pool.- Specified by:
clearThreadContextin interfaceorg.apache.manifoldcf.core.interfaces.IConnector- Overrides:
clearThreadContextin classorg.apache.manifoldcf.core.connector.BaseConnector
-
getSession
protected void getSession() throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionStart a session- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
poll
public void poll() throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionThis method is periodically called for all connectors that are connected but not in active use.- Specified by:
pollin interfaceorg.apache.manifoldcf.core.interfaces.IConnector- Overrides:
pollin classorg.apache.manifoldcf.core.connector.BaseConnector- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
check
public java.lang.String check() throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionCheck status of connection.- Specified by:
checkin interfaceorg.apache.manifoldcf.core.interfaces.IConnector- Overrides:
checkin classorg.apache.manifoldcf.core.connector.BaseConnector- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
disconnect
public void disconnect() throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionClose the connection. Call this before discarding the repository connector.- Specified by:
disconnectin interfaceorg.apache.manifoldcf.core.interfaces.IConnector- Overrides:
disconnectin classorg.apache.manifoldcf.core.connector.BaseConnector- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
getBinNames
public java.lang.String[] getBinNames(java.lang.String documentIdentifier)
Get the bin name string for a document identifier. The bin name describes the queue to which the document will be assigned for throttling purposes. Throttling controls the rate at which items in a given queue are fetched; it does not say anything about the overall fetch rate, which may operate on multiple queues or bins. For example, if you implement a web crawler, a good choice of bin name would be the server name, since that is likely to correspond to a real resource that will need real throttle protection.- Specified by:
getBinNamesin interfaceorg.apache.manifoldcf.crawler.interfaces.IRepositoryConnector- Overrides:
getBinNamesin classorg.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector- Parameters:
documentIdentifier- is the document identifier.- Returns:
- the bin name.
-
addSeedDocuments
public java.lang.String addSeedDocuments(org.apache.manifoldcf.crawler.interfaces.ISeedingActivity activities, org.apache.manifoldcf.core.interfaces.Specification spec, java.lang.String lastSeedVersion, long seedTime, int jobMode) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, org.apache.manifoldcf.agents.interfaces.ServiceInterruptionQueue "seed" documents. Seed documents are the starting places for crawling activity. Documents are seeded when this method calls appropriate methods in the passed in ISeedingActivity object. This method can choose to find repository changes that happen only during the specified time interval. The seeds recorded by this method will be viewed by the framework based on what the getConnectorModel() method returns. It is not a big problem if the connector chooses to create more seeds than are strictly necessary; it is merely a question of overall work required. The end time and seeding version string passed to this method may be interpreted for greatest efficiency. For continuous crawling jobs, this method will be called once, when the job starts, and at various periodic intervals as the job executes. When a job's specification is changed, the framework automatically resets the seeding version string to null. The seeding version string may also be set to null on each job run, depending on the connector model returned by getConnectorModel(). Note that it is always ok to send MORE documents rather than less to this method. The connector will be connected before this method can be called.- Specified by:
addSeedDocumentsin interfaceorg.apache.manifoldcf.crawler.interfaces.IRepositoryConnector- Overrides:
addSeedDocumentsin classorg.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector- Parameters:
activities- is the interface this method should use to perform whatever framework actions are desired.spec- is a document specification (that comes from the job).seedTime- is the end of the time range of documents to consider, exclusive.lastSeedVersion- is the last seeding version string for this job, or null if the job has no previous seeding version string.jobMode- is an integer describing how the job is being run, whether continuous or once-only.- Returns:
- an updated seeding version string, to be stored with the job.
- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionorg.apache.manifoldcf.agents.interfaces.ServiceInterruption
-
processDocuments
public void processDocuments(java.lang.String[] documentIdentifiers, org.apache.manifoldcf.crawler.interfaces.IExistingVersions statuses, org.apache.manifoldcf.core.interfaces.Specification spec, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, int jobMode, boolean usesDefaultAuthority) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, org.apache.manifoldcf.agents.interfaces.ServiceInterruptionProcess a set of documents. This is the method that should cause each document to be fetched, processed, and the results either added to the queue of documents for the current job, and/or entered into the incremental ingestion manager. The document specification allows this class to filter what is done based on the job. The connector will be connected before this method can be called.- Specified by:
processDocumentsin interfaceorg.apache.manifoldcf.crawler.interfaces.IRepositoryConnector- Overrides:
processDocumentsin classorg.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector- Parameters:
documentIdentifiers- is the set of document identifiers to process.statuses- are the currently-stored document versions for each document in the set of document identifiers passed in above.activities- is the interface this method should use to queue up new document references and ingest documents.jobMode- is an integer describing how the job is being run, whether continuous or once-only.usesDefaultAuthority- will be true only if the authority in use for these documents is the default one.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionorg.apache.manifoldcf.agents.interfaces.ServiceInterruption
-
loginAndFetch
protected void loginAndFetch(WebcrawlerConnector.FetchStatus fetchStatus, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, java.lang.String documentIdentifier, SequenceCredentials sessionCredential, java.lang.String globalSequenceEvent) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, org.apache.manifoldcf.agents.interfaces.ServiceInterruption
- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionorg.apache.manifoldcf.agents.interfaces.ServiceInterruption
-
processDocument
protected void processDocument(org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, java.lang.String documentIdentifier, java.lang.String versionString, boolean indexDocument, java.util.Map<java.lang.String,java.util.Set<java.lang.String>> metaHash, java.lang.String[] acls, WebcrawlerConnector.DocumentURLFilter filter) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, org.apache.manifoldcf.agents.interfaces.ServiceInterruption- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionorg.apache.manifoldcf.agents.interfaces.ServiceInterruption
-
extractContentType
protected static java.lang.String extractContentType(java.lang.String contentType)
-
extractEncoding
protected static java.lang.String extractEncoding(java.lang.String contentType)
-
extractMimeType
protected static java.lang.String extractMimeType(java.lang.String contentType)
-
handleIOException
protected static void handleIOException(java.io.IOException e, java.lang.String context) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, org.apache.manifoldcf.agents.interfaces.ServiceInterruption- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionorg.apache.manifoldcf.agents.interfaces.ServiceInterruption
-
getMaxDocumentRequest
public int getMaxDocumentRequest()
Get the maximum number of documents to amalgamate together into one batch, for this connector.- Specified by:
getMaxDocumentRequestin interfaceorg.apache.manifoldcf.crawler.interfaces.IRepositoryConnector- Overrides:
getMaxDocumentRequestin classorg.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector- Returns:
- the maximum number. 0 indicates "unlimited".
-
outputConfigurationHeader
public void outputConfigurationHeader(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.ConfigParams parameters, java.util.List<java.lang.String> tabsArray) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, java.io.IOExceptionOutput the configuration header section. This method is called in the head section of the connector's configuration page. Its purpose is to add the required tabs to the list, and to output any javascript methods that might be needed by the configuration editing HTML.- Specified by:
outputConfigurationHeaderin interfaceorg.apache.manifoldcf.core.interfaces.IConnector- Overrides:
outputConfigurationHeaderin classorg.apache.manifoldcf.core.connector.BaseConnector- Parameters:
threadContext- is the local thread context.out- is the output to which any HTML should be sent.parameters- are the configuration parameters, as they currently exist, for this connection being configured.tabsArray- is an array of tab names. Add to this array any tab names that are specific to the connector.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionjava.io.IOException
-
outputConfigurationBody
public void outputConfigurationBody(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.ConfigParams parameters, java.lang.String tabName) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, java.io.IOExceptionOutput the configuration body section. This method is called in the body section of the connector's configuration page. Its purpose is to present the required form elements for editing. The coder can presume that the HTML that is output from this configuration will be within appropriate <html>, <body>, and <form> tags. The name of the form is "editconnection".- Specified by:
outputConfigurationBodyin interfaceorg.apache.manifoldcf.core.interfaces.IConnector- Overrides:
outputConfigurationBodyin classorg.apache.manifoldcf.core.connector.BaseConnector- Parameters:
threadContext- is the local thread context.out- is the output to which any HTML should be sent.parameters- are the configuration parameters, as they currently exist, for this connection being configured.tabName- is the current tab name.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionjava.io.IOException
-
processConfigurationPost
public java.lang.String processConfigurationPost(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IPostParameters variableContext, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.ConfigParams parameters) throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionProcess a configuration post. This method is called at the start of the connector's configuration page, whenever there is a possibility that form data for a connection has been posted. Its purpose is to gather form information and modify the configuration parameters accordingly. The name of the posted form is "editconnection".- Specified by:
processConfigurationPostin interfaceorg.apache.manifoldcf.core.interfaces.IConnector- Overrides:
processConfigurationPostin classorg.apache.manifoldcf.core.connector.BaseConnector- Parameters:
threadContext- is the local thread context.variableContext- is the set of variables available from the post, including binary file post information.parameters- are the configuration parameters, as they currently exist, for this connection being configured.- Returns:
- null if all is well, or a string error message if there is an error that should prevent saving of the connection (and cause a redirection to an error page).
- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
viewConfiguration
public void viewConfiguration(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.ConfigParams parameters) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, java.io.IOExceptionView configuration. This method is called in the body section of the connector's view configuration page. Its purpose is to present the connection information to the user. The coder can presume that the HTML that is output from this configuration will be within appropriate <html> and <body>tags.- Specified by:
viewConfigurationin interfaceorg.apache.manifoldcf.core.interfaces.IConnector- Overrides:
viewConfigurationin classorg.apache.manifoldcf.core.connector.BaseConnector- Parameters:
threadContext- is the local thread context.out- is the output to which any HTML should be sent.parameters- are the configuration parameters, as they currently exist, for this connection being configured.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionjava.io.IOException
-
outputSpecificationHeader
public void outputSpecificationHeader(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.Specification ds, int connectionSequenceNumber, java.util.List<java.lang.String> tabsArray) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, java.io.IOExceptionOutput the specification header section. This method is called in the head section of a job page which has selected a repository connection of the current type. Its purpose is to add the required tabs to the list, and to output any javascript methods that might be needed by the job editing HTML. The connector will be connected before this method can be called.- Specified by:
outputSpecificationHeaderin interfaceorg.apache.manifoldcf.crawler.interfaces.IRepositoryConnector- Overrides:
outputSpecificationHeaderin classorg.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector- Parameters:
out- is the output to which any HTML should be sent.locale- is the locale the output is preferred to be in.ds- is the current document specification for this job.connectionSequenceNumber- is the unique number of this connection within the job.tabsArray- is an array of tab names. Add to this array any tab names that are specific to the connector.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionjava.io.IOException
-
outputSpecificationBody
public void outputSpecificationBody(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.Specification ds, int connectionSequenceNumber, int actualSequenceNumber, java.lang.String tabName) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, java.io.IOExceptionOutput the specification body section. This method is called in the body section of a job page which has selected a repository connection of the current type. Its purpose is to present the required form elements for editing. The coder can presume that the HTML that is output from this configuration will be within appropriate <html>, <body>, and <form> tags. The name of the form is always "editjob". The connector will be connected before this method can be called.- Specified by:
outputSpecificationBodyin interfaceorg.apache.manifoldcf.crawler.interfaces.IRepositoryConnector- Overrides:
outputSpecificationBodyin classorg.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector- Parameters:
out- is the output to which any HTML should be sent.locale- is the locale the output is preferred to be in.ds- is the current document specification for this job.connectionSequenceNumber- is the unique number of this connection within the job.actualSequenceNumber- is the connection within the job that has currently been selected.tabName- is the current tab name. (actualSequenceNumber, tabName) form a unique tuple within the job.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionjava.io.IOException
-
processSpecificationPost
public java.lang.String processSpecificationPost(org.apache.manifoldcf.core.interfaces.IPostParameters variableContext, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.Specification ds, int connectionSequenceNumber) throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionProcess a specification post. This method is called at the start of job's edit or view page, whenever there is a possibility that form data for a connection has been posted. Its purpose is to gather form information and modify the document specification accordingly. The name of the posted form is always "editjob". The connector will be connected before this method can be called.- Specified by:
processSpecificationPostin interfaceorg.apache.manifoldcf.crawler.interfaces.IRepositoryConnector- Overrides:
processSpecificationPostin classorg.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector- Parameters:
variableContext- contains the post data, including binary file-upload information.locale- is the locale the output is preferred to be in.ds- is the current document specification for this job.connectionSequenceNumber- is the unique number of this connection within the job.- Returns:
- null if all is well, or a string error message if there is an error that should prevent saving of the job (and cause a redirection to an error page).
- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
viewSpecification
public void viewSpecification(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.Specification ds, int connectionSequenceNumber) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, java.io.IOExceptionView specification. This method is called in the body section of a job's view page. Its purpose is to present the document specification information to the user. The coder can presume that the HTML that is output from this configuration will be within appropriate <html> and <body>tags. The connector will be connected before this method can be called.- Specified by:
viewSpecificationin interfaceorg.apache.manifoldcf.crawler.interfaces.IRepositoryConnector- Overrides:
viewSpecificationin classorg.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector- Parameters:
out- is the output to which any HTML should be sent.locale- is the locale the output is preferred to be in.ds- is the current document specification for this job.connectionSequenceNumber- is the unique number of this connection within the job.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionjava.io.IOException
-
makeSessionLoginEventName
protected java.lang.String makeSessionLoginEventName(org.apache.manifoldcf.crawler.interfaces.INamingActivity activities, java.lang.String sequenceKey)Calculate the event name for session login.
-
makeDNSEventName
protected java.lang.String makeDNSEventName(org.apache.manifoldcf.crawler.interfaces.INamingActivity activities, java.lang.String hostNameKey)Calculate the event name for DNS access.
-
lookupIPAddress
protected int lookupIPAddress(java.lang.String documentIdentifier, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, java.lang.String hostName, long currentTime, java.lang.StringBuilder ipAddressBuffer) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, org.apache.manifoldcf.agents.interfaces.ServiceInterruptionLook up an ipaddress given a non-canonical host name.- Returns:
- appropriate status.
- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionorg.apache.manifoldcf.agents.interfaces.ServiceInterruption
-
makeRobotsKey
protected static java.lang.String makeRobotsKey(java.lang.String protocol, java.lang.String hostName, int port)Construct the robots key for a host. This is used to look up robots info in the database, and to form the corresponding event name.
-
makeRobotsEventName
protected java.lang.String makeRobotsEventName(org.apache.manifoldcf.crawler.interfaces.INamingActivity versionActivities, java.lang.String robotsKey)Construct a name for the global web-connector robots event.
-
checkFetchAllowed
protected int checkFetchAllowed(java.lang.String documentIdentifier, java.lang.String protocol, java.lang.String hostIPAddress, int port, PageCredentials credential, org.apache.manifoldcf.connectorcommon.interfaces.IKeystoreManager trustStore, java.lang.String hostName, java.lang.String[] binNames, long currentTime, java.lang.String pathString, org.apache.manifoldcf.crawler.interfaces.IProcessActivity versionActivities, int connectionLimit, java.lang.String proxyHost, int proxyPort, java.lang.String proxyAuthDomain, java.lang.String proxyAuthUsername, java.lang.String proxyAuthPassword) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, org.apache.manifoldcf.agents.interfaces.ServiceInterruptionCheck robots to see if fetch is allowed.- Returns:
- appropriate resultstatus code.
- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionorg.apache.manifoldcf.agents.interfaces.ServiceInterruption
-
makeDocumentIdentifier
protected java.lang.String makeDocumentIdentifier(java.lang.String parentIdentifier, java.lang.String rawURL, WebcrawlerConnector.DocumentURLFilter filter, org.apache.manifoldcf.crawler.interfaces.IHistoryActivity activities) throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionConvert an absolute or relative URL to a document identifier. This may involve several steps at some point, but right now it does NOT involve converting the host name to a canonical host name. (Doing so would destroy the ability of virtually hosted sites to do the right thing, since the original host name would be lost.) Thus, we do the conversion to IP address right before we actually fetch the document.- Parameters:
parentIdentifier- the identifier of the document in which the raw url was found, or null if none.rawURL- the starting, un-normalized, un-canonicalized URL.filter- the filter object, used to remove unmatching URLs.- Returns:
- the canonical URL (the document identifier), or null if the url was illegal.
- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
doCanonicalization
protected java.lang.String doCanonicalization(WebcrawlerConnector.DocumentURLFilter filter, WebURL url) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, java.net.URISyntaxException
Code to canonicalize a URL. If URL cannot be canonicalized (and is illegal) return null.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionjava.net.URISyntaxException
-
isContentInteresting
protected boolean isContentInteresting(org.apache.manifoldcf.crawler.interfaces.IFingerprintActivity activities, java.lang.String documentIdentifier, int response, java.lang.String contentType) throws org.apache.manifoldcf.agents.interfaces.ServiceInterruption, org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionCode to check if data is interesting, based on response code and content type.- Throws:
org.apache.manifoldcf.agents.interfaces.ServiceInterruptionorg.apache.manifoldcf.core.interfaces.ManifoldCFException
-
documentIdentifiertoFileName
protected java.lang.String documentIdentifiertoFileName(java.lang.String documentIdentifier) throws java.net.URISyntaxExceptionConvert a document identifier to filename.- Parameters:
documentIdentifier-- Throws:
java.net.URISyntaxException
-
findRedirectionURI
protected java.lang.String findRedirectionURI(java.lang.String currentURI) throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionFind a redirection URI, if it exists- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
findHTMLForm
protected FormData findHTMLForm(java.lang.String currentURI, LoginParameters lp) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Find matching HTML form data, if present. Return null if not.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
findPreferredRedirectionURI
protected java.lang.String findPreferredRedirectionURI(java.lang.String currentURI, LoginParameters lp) throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionFind a preferred redirection URI, if it exists- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
findSpecifiedContent
protected java.lang.String findSpecifiedContent(java.lang.String currentURI, LoginParameters lp) throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionFind existence of specific content on the page (never finds a URL)- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
findHTMLLinkURI
protected java.lang.String findHTMLLinkURI(java.lang.String currentURI, LoginParameters lp) throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionFind HTML link URI, if present, making sure specified preference is matched.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
extractLinks
protected boolean extractLinks(java.lang.String documentIdentifier, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, WebcrawlerConnector.DocumentURLFilter filter) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, org.apache.manifoldcf.agents.interfaces.ServiceInterruptionCode to extract links from an already-fetched document.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionorg.apache.manifoldcf.agents.interfaces.ServiceInterruption
-
handleRedirects
protected void handleRedirects(java.lang.String documentURI, IRedirectionHandler handler) throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionHandle extracting the redirect link from a redirect response.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
handleXML
protected void handleXML(java.lang.String documentURI, IXMLHandler handler) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, org.apache.manifoldcf.agents.interfaces.ServiceInterruptionHandle document references from XML. Right now we only understand RSS.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionorg.apache.manifoldcf.agents.interfaces.ServiceInterruption
-
handleHTML
protected void handleHTML(java.lang.String documentURI, IHTMLHandler handler) throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionHandle document references from HTML- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
isDocumentText
protected boolean isDocumentText(java.lang.String documentURI) throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionIs the document text, as far as we can tell?- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
isText
protected static boolean isText(byte[] beginChunk, int chunkLength)Test to see if a document is text or not. The first n bytes are passed in, and this code returns "true" if it thinks they represent text. The code has been lifted algorithmically from products/Sharecrawler/Fingerprinter.pas, which was based on "perldoc -f -T".
-
isStrange
protected static boolean isStrange(byte x)
Check if character is not typical ASCII or utf-8.
-
isWhiteSpace
protected static boolean isWhiteSpace(byte x)
Check if a byte is a whitespace character.
-
stringToArray
protected static java.util.List<java.lang.String> stringToArray(java.lang.String input)
Read a string as a sequence of individual expressions, urls, etc.
-
compileList
protected static void compileList(java.util.List<java.util.regex.Pattern> output, java.util.List<java.lang.String> input) throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionCompile all regexp entries in the passed in list, and add them to the output list.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
getPageCredential
protected PageCredentials getPageCredential(java.lang.String documentIdentifier)
Get the page credentials for a given document identifier (URL)
-
getSequenceCredential
protected SequenceCredentials getSequenceCredential(java.lang.String documentIdentifier)
Get the sequence credentials for a given document identifier (URL)
-
getTrustStore
protected org.apache.manifoldcf.connectorcommon.interfaces.IKeystoreManager getTrustStore(java.lang.String documentIdentifier) throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionGet the trust store for a given document identifier (URL)- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
getAcls
protected static java.lang.String[] getAcls(org.apache.manifoldcf.core.interfaces.Specification spec)
Grab forced acl out of document specification.- Parameters:
spec- is the document specification.- Returns:
- the acls.
-
findExcludedHeaders
protected static java.util.Set<java.lang.String> findExcludedHeaders(org.apache.manifoldcf.core.interfaces.Specification spec) throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionRead a document specification to get a set of excluded headers- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
calculateDocumentEvents
protected java.lang.String[] calculateDocumentEvents(org.apache.manifoldcf.crawler.interfaces.INamingActivity activities, java.lang.String documentIdentifier)Calculate events that should be associated with a document.
-
getFinalURL
public static java.lang.String getFinalURL(java.lang.String url) throws java.io.IOException, java.net.URISyntaxExceptionIf the initial url is permanently or temporarly redirected (code 301 or 302), the method returns the destination url- Parameters:
url- The initial url- Returns:
- the url after redirection
- Throws:
java.io.IOExceptionjava.net.URISyntaxException
-
-