Examples of dkpro.similarity.algorithms.sspace.util.LatentSemanticAnalysis

Package dkpro.similarity.algorithms.sspace.util

Examples of dkpro.similarity.algorithms.sspace.util.LatentSemanticAnalysis

dkpro.similarity.algorithms.sspace.util.LatentSemanticAnalysis
olorado.edu/papers/dp1.LSAintro.pdf">here
Landauer, T. K., and Dumais, S. T. (1997). A solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211-240. Available here

See the Wikipedia page on Latent Semantic Analysis for an execuative summary.

LSA first processes documents into a word-document matrix where each unique word is a assigned a row in the matrix, and each column represents a document. The values of ths matrix correspond to the number of times the row's word occurs in the column's document. After the matrix has been built, the Singular Value Decomposition (SVD) is used to reduce the dimensionality of the original word-document matrix, denoted as A. The SVD is a way of factoring any matrix A into three matrices U Σ V^T such that Σ is a diagonal matrix containing the singular values of A. The singular values of Σ are ordered according to which causes the most variance in the values of A. The original matrix may be approximated by recomputing the matrix with only k of these singular values and setting the rest to 0. The approximated matrix Â = U_k Σ_k V_k^T is the least squares best-ﬁt rank-k approximation of A. LSA reduces the dimensions by keeping only the ﬁrst k dimensions from the row vectors of U. These vectors form the semantic space of the words.

This class offers configurable preprocessing and dimensionality reduction. through three parameters.

Property: {@value #MATRIX_TRANSFORM_PROPERTY} Default: {@link LogEntropyTransform}: This variable sets the preprocessing algorithm to use on the term-document matrix prior to computing the SVD. The property value should be the fully qualified named of a class that implements {@link Transform}. The class should be public, not abstract, and should provide a public no-arg constructor.
Property: {@value LSA_DIMENSIONS_PROPERTY} Default: {@code 300}: The number of dimensions to use for the semantic space. This value is used as input to the SVD.
Property: {@value LSA_SVD_ALGORITHM_PROPERTY} Default: {@link edu.ucla.sspace.matrix.SVD.Algorithm#ANY}: This property sets the specific SVD algorithm that LSA will use to reduce the dimensionality of the word-document matrix. In general, users should not need to set this property, as the default behavior will choose the fastest available on the system.

This class is thread-safe for concurrent calls of {@link #processDocument(BufferedReader) processDocument}. Once {@link #processSpace(Properties) processSpace} has been called, no further calls to{@code processDocument} should be made. This implementation does not supportaccess to the semantic vectors until after {@code processSpace} has beencalled. @see Transform @see SVD @author David Jurgens

   * @throws IOException
   */
  public static SemanticSpace createSemanticSpace(File aInputDir, int aMaxDimensions)
    throws IOException
  {
    LatentSemanticAnalysis sspace = new LatentSemanticAnalysis();


    Collection<File> documents = FileUtils.listFiles(aInputDir, new String[] { "txt" }, true);


    for (File document : documents) {
      BufferedReader reader = new BufferedReader(new FileReader(document));
      sspace.processDocument(reader);
    }


    int dimensions = Math.min(documents.size(), aMaxDimensions <= 0 ? 300 : aMaxDimensions);


    Properties props = new Properties();
    props.setProperty(LatentSemanticAnalysis.LSA_DIMENSIONS_PROPERTY, Integer.toString(dimensions));
    sspace.processSpace(props);


    return sspace;
  }

    super.initialize(context);


    nrOfDocuments = 0;
    
        try {
            sspace = new LatentSemanticAnalysis();
        }
        catch (IOException e) {
            throw new ResourceInitializationException(e);
        }
  }

Related Classes of dkpro.similarity.algorithms.sspace.util.LatentSemanticAnalysis

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.