Examples of cascading.scheme.hadoop.TextDelimited

cascading.scheme.hadoop.TextDelimited
Class TextDelimited is a sub-class of {@link TextLine}. It provides direct support for delimited text files, like TAB (\t) or COMMA (,) delimited files. It also optionally allows for quoted values.
TextDelimited may also be used to skip the "header" in a file, where the header is defined as the very first line in every input file. That is, if the byte offset of the current line from the input is zero (0), that line will be skipped.
It is assumed if sink/source {@code fields} is set to either {@link Fields#ALL} or {@link Fields#UNKNOWN} and{@code skipHeader} or {@code hasHeader} is {@code true}, the field names will be retrieved from the header of the file and used during planning. The header will parsed with the same rules as the body of the file.
By default headers are not skipped.
TextDelimited may also be used to write a "header" in a file. The fields names for the header are taken directly from the declared fields. Or if the declared fields are {@link Fields#ALL} or {@link Fields#UNKNOWN}, the resolved field names will be used, if any.
By default headers are not written.
If {@code hasHeaders} is set to {@code true} on a constructor, both {@code skipHeader} and {@code writeHeader} willbe set to {@code true}.
By default this {@link cascading.scheme.Scheme} is both {@code strict} and {@code safe}.
Strict meaning if a line of text does not parse into the expected number of fields, this class will throw a {@link TapException}. If strict is {@code false}, then {@link Tuple} will be returned with {@code null} valuesfor the missing fields.
Safe meaning if a field cannot be coerced into an expected type, a {@code null} will be used for the value.If safe is {@code false}, a {@link TapException} will be thrown.
Also by default, {@code quote} strings are not searched for to improve processing speed. If a file isCOMMA delimited but may have COMMA's in a value, the whole value should be surrounded by the quote string, typically double quotes ( {@literal "}).
Note all empty fields in a line will be returned as {@code null} unless coerced into a new type.
This Scheme may source/sink {@link Fields#ALL}, when given on the constructor the new instance will automatically default to strict == false as the number of fields parsed are arbitrary or unknown. A type array may not be given either, so all values will be returned as Strings.
By default, all text is encoded/decoded as UTF-8. This can be changed via the {@code charsetName} constructorargument.
To override field and line parsing behaviors, sub-class {@link DelimitedParser} or provide a{@link cascading.scheme.util.FieldTypeResolver} implementation.
Note that there should be no expectation that TextDelimited, or specifically {@link DelimitedParser}, can handle all delimited and quoted combinations reliably. Attempting to do so would impair its performance and maintainability.
Further, it can be safely said any corrupted files will not be supported for obvious reasons. Corrupted files may result in exceptions or could cause edge cases in the underlying java regular expression engine.
A large part of Cascading was designed to help users cleans data. Thus the recommendation is to create Flows that are responsible for cleansing large data-sets when faced with the problem
DelimitedParser maybe sub-classed and extended if necessary. @see TextLine

        CascadeConnector cascadeConnector = new CascadeConnector(cfg);
        cascadeConnector.connect(flows).complete();
    }


    private Tap sourceTap() {
        return new Hfs(new TextDelimited(new Fields("id", "name", "url", "picture", "ts")), INPUT);
    }

View Full Code Here

        props.put(ConfigurationOptions.ES_INPUT_JSON, "true");
        return props;
    }


    private Tap sourceTap() {
        return new Hfs(new TextDelimited(new Fields("line")), INPUT);
    }

View Full Code Here

        groupByItemIDPipe.getStepConfigDef().setProperty("itemIndexPath", itemIndexPath.toString());
        // for these matrices the group by key is the id from the Mahout row key
        groupByItemIDPipe.getStepConfigDef().setProperty("rowIndexPath", iDIndexPath.toString());
        groupByItemIDPipe.getStepConfigDef().setProperty("joining", "true");


        Tap groupedOutputSink = new Hfs(new TextDelimited(true,","), groupedCSVOutputPath.toString());


        FlowDef flowDef = new FlowDef()
            .setName("group-DRMs-by-key")
            .addSource(lhs, dRM1Source)
            .addSource(rhs, dRM2Source)

View Full Code Here

        //pass these to the output function so the strings from the indexes can be written instead of the
        //binary values of the Keys and Vectors in the DRMs
        dRM1.getStepConfigDef().setProperty("itemIndexPath", itemIndexPath.toString());
        dRM1.getStepConfigDef().setProperty("rowIndexPath", iDIndexPath.toString());
        dRM1.getStepConfigDef().setProperty("joining", "false");
        Tap outputSink = new Hfs(new TextDelimited(true,","), cSVOutputPath.toString());


        FlowDef flowDef = new FlowDef()
            .setName("convert-to-CSV")
            .addSource(dRM1, dRM1Source)
            .addTailSink(dRM1, outputSink);

View Full Code Here

    Properties properties = new Properties();
    AppProps.setApplicationJarClass( properties, Main.class );
    FlowConnector flowConnector = new HadoopFlowConnector( properties );


    // create SOURCE taps, and read from local file system if inputs are not URLs
    Tap tweetTap = makeTap( tweetPath, new TextDelimited( true, "\t" ) );


    Tap stopTap = makeTap( stopWords, new TextDelimited( new Fields( "stop" ), true, "\t" ) );


    // create SINK taps, replacing previous output if needed
    Tap tokenTap = new Hfs( new TextDelimited( true, "\t" ), tokenPath, SinkMode.REPLACE );
    Tap similarityTap = new Hfs( new TextDelimited( true, "\t" ), similarityPath, SinkMode.REPLACE );


    /*
    flow part #1
    generate a bipartite map of (uid, token), while filtering out stop-words
    */

View Full Code Here

    Properties properties = new Properties();
    AppProps.setApplicationJarClass( properties, Main.class );
    HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );


    // create source and sink taps
    Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );
    Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );


    Fields stop = new Fields( "stop" );
    Tap stopTap = new Hfs( new TextDelimited( stop, true, "\t" ), stopPath );
    Tap tfidfTap = new Hfs( new TextDelimited( true, "\t" ), tfidfPath );


    // specify a regex operation to split the "document" text lines into a token stream
    Fields token = new Fields( "token" );
    Fields text = new Fields( "text" );
    RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );

View Full Code Here

    Properties properties = new Properties();
    AppProps.setApplicationJarClass( properties, Main.class );
    HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );


    // create source and sink taps
    Tap inputTap = new Hfs( new TextDelimited( true, "\t" ), inputPath );
    Tap classifyTap = new Hfs( new TextDelimited( true, "\t" ), classifyPath );


    // handle command line options
    OptionParser optParser = new OptionParser();
    optParser.accepts( "pmml" ).withRequiredArg();

View Full Code Here

    AppProps.setApplicationJarClass( properties, Main.class );
    HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );


    // create taps for sources, sinks, traps
    Tap gisTap = new Hfs( new TextLine( new Fields( "line" ) ), gisPath );
    Tap metaTreeTap = new Hfs( new TextDelimited( true, "\t" ), metaTreePath );
    Tap metaRoadTap = new Hfs( new TextDelimited( true, "\t" ), metaRoadPath );
    Tap logsTap = new Hfs( new TextDelimited( true, "," ), logsPath );
    Tap trapTap = new Hfs( new TextDelimited( true, "\t" ), trapPath );
    Tap tsvTap = new Hfs( new TextDelimited( true, "\t" ), tsvPath );
    Tap treeTap = new Hfs( new TextDelimited( true, "\t" ), treePath );
    Tap roadTap = new Hfs( new TextDelimited( true, "\t" ), roadPath );
    Tap parkTap = new Hfs( new TextDelimited( true, "\t" ), parkPath );
    Tap shadeTap = new Hfs( new TextDelimited( true, "\t" ), shadePath );
    Tap recoTap = new Hfs( new TextDelimited( true, "\t" ), recoPath );


    // specify a regex to split the GIS dump into known fields
    Fields fieldDeclaration = new Fields( "blurb", "misc", "geo", "kind" );
    String regex =  "^\"(.*)\",\"(.*)\",\"(.*)\",\"(.*)\"$";
    int[] gisGroups = { 1, 2, 3, 4 };

View Full Code Here


  public static FlowDef
  createFlowDef( String docPath, String wcPath )
   {
    // create source and sink taps
    Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );
    Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );


    // specify a regex operation to split the "document" text lines into a token stream
    Fields token = new Fields( "token" );
    Fields text = new Fields( "text" );
    RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );

View Full Code Here

    }


  @Override
  public Tap getDelimitedFile( Fields fields, boolean hasHeader, String delimiter, String quote, Class[] types, String filename, SinkMode mode )
    {
    return new Hfs( new TextDelimited( fields, hasHeader, delimiter, quote, types ), safeFileName( filename ), mode );
    }

View Full Code Here

0 1

TOP

Related Classes of cascading.scheme.hadoop.TextDelimited

cascading.lingual.platform.hadoop.HadoopDefaultFactory

cascading.lingual.platform.hadoop2.Hadoop2MR1DefaultFactory

cascading.pattern.Main

cascading.platform.hadoop.BaseHadoopPlatform

cascading.scheme.util.DelimitedParser

cascading.tap.hadoop.HadoopTapPlatformTest

cascading.tap.hadoop.Hfs

cascading.tuple.Fields

cascading.tuple.Tuple

cascading.tuple.TupleEntry

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.